TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Bad Results from Vector Search?

2 点作者 tuckerconnelly超过 1 年前
Hi, I&#x27;m building an RAG system with OpenAI embeddings and the ChatGPT api.<p>I chunked all the documents into 400-800 character chunks, vectorized them all, and put them in a vector database.<p>The results are pretty bad--the surfaced document chunks kind-of-but-not-really match up with the query.<p>I&#x27;m getting much better results from simple keyword searches (using meilisearch).<p>Am I doing something wrong? Do I need to use a fine-tuned model like BERT? Is this technology vastly overhyped?

3 条评论

PaulHoule超过 1 年前
I&#x27;ve used<p><a href="https:&#x2F;&#x2F;sbert.net&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;sbert.net&#x2F;</a><p>My take is you might like keyword searches better for some queries and you might like embedding search for others.<p>The problems of: (1) How to combine keyword search and embedding search (you&#x27;d imagine you&#x27;d want a ranking function that handles both) and (2) How to handle chunks are both hard.<p>As for (2) you probably want to make the chunks as big as you practically can, you should be chunking on tokens instead of characters if you at all can.<p>With the chunks of course you don&#x27;t get a score for the query-document relationship you get the query-chunk score instead which isn&#x27;t <i>quite</i> the score you really want, aggregating all the chunk hits and properly chunking the data is an open problem to say the least.
评论 #37786597 未加载
simonmesmith超过 1 年前
From some experience I&#x27;ve had with this:<p>* Is that the right chunk size? How much of a chunk might contain the relevant information? Is it better for your use case to chunk by sentence? I&#x27;ve done RAG with document chunks, sentences, and triplets (source -&gt; relation -&gt; target). How you chunk can have a big impact.<p>* One approach that I&#x27;ve seen work very well is (1) first, use keyword or entity search to limit results, then (2) use semantic similarity to the query to rank those results. This is how, for example, they do it at LitSense for sentences from scientific papers: <a href="https:&#x2F;&#x2F;www.ncbi.nlm.nih.gov&#x2F;research&#x2F;litsense&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.ncbi.nlm.nih.gov&#x2F;research&#x2F;litsense&#x2F;</a>. Paper here: <a href="https:&#x2F;&#x2F;academic.oup.com&#x2F;nar&#x2F;article&#x2F;47&#x2F;W1&#x2F;W594&#x2F;5479473" rel="nofollow noreferrer">https:&#x2F;&#x2F;academic.oup.com&#x2F;nar&#x2F;article&#x2F;47&#x2F;W1&#x2F;W594&#x2F;5479473</a>.<p>* You still need metadata. For example, if a user asks for something like &quot;show me new information about X,&quot; the concept of &quot;new&quot; won&#x27;t get embedded in the text. You&#x27;ll need to convert that to some kind of date search. This is where doing RAG with something like OpenAI function calls can be great. It can see &quot;new&quot; and use that to pass a date to a date filter.<p>* I&#x27;ve found some embeddings can be frustrating because they conflate things that can even be opposites. For example, &quot;increase&quot; and &quot;decrease&quot; might show up as similar because they both get mapped into the space for &quot;direction.&quot; This probably isn&#x27;t an issue with better (I assume higher dimensional) embeddings, but it can be problematic with some embeddings.<p>* You might need specialized domain embeddings for a very specific domain. For example, law, finance, biology, and so forth. Certain words or concepts that are very specific to a domain might not be properly captured in a general embedding space. A &quot;knockout&quot; means something very different in sports, when talking about an attractive person, or in biology when it refers to genetic manipulation.
catlover76超过 1 年前
Yes, I too have experienced this pain point. One thing to try is to embed using something other than OpenAI ada.