TechEcho

The vast majority of AI systems in production rely on basic semantic search to provide context. A single retrieval call into a vector database powers most Retrieval-Augmented Generation systems today. If you’ve tried using models like these, you know exactly how limited they are in truly understanding your data.I looked into thousands of datapoints of actual user queries to clearly classify and determine exactly where and when semantic search starts to break down and provide missing or hallucinated results.I pattern matched dozens of failure modes. Here are three of them. If you want to hear more you can reach me at pipitone@zeroentropy.dev1. Negated Semantic Queries: “Which electric vehicle articles do not include any reference to Elon Musk?”Both keyword and semantic searches will immediately retrieve specifically the electric vehicle articles that include a reference to Elon Musk.2. Multi-Hop Queries “If the acquiring company fails to hold a shareholder’s meeting, what is the penalty?”To answer this query, you need to work step-by-step. You would need to find the paragraph that says what happens when you fail to hold a shareholder meeting. Let’s say that such a search reveals that the agreement will be terminated in that circumstance. Then, you must search for what penalties are incurred by terminating the agreement. A simple semantic search will return paragraphs about shareholder’s meetings, and it will also return paragraphs about any kind of penalty — but, it will fail to link the two and realize that specifically a “termination penalty” must be boosted to the first place result.Multi-hop queries require multiple steps of retrieval to get to the right information.3. Fuzzy Filtering Queries “What diagnostic methods are suggested for early-stage cancer, in papers with a sample size of over 2000”Sample sizes often occur in the first paragraph of a medical research article. Meanwhile, the specific diagnostic method is likely mentioned deep the article. So, these two pieces of information often do not occur in the same chunk. Your RAG pipeline will be happy to show diagnostic methods for early-stage cancer in articles that do not match the requested sample size — Not only that, but the correct answer will be almost impossible to find if “over 2000” is a rare filter.----Another interesting topic is evals for retrieval. At this point, I've talked to hundreds of developers, and discovered that retrieval evaluation is often overlooked, despite the impact on an AI’s intelligence and hallucination rate.In most cases, evaluations occur at the end-user stage, either through direct feedback mechanisms like thumbs up/down ratings. However, few have a method of associating “thumbs down” ratings with exactly what went wrong and where. Was it a UX problem? Or an LLM hallucination? Did the retrieval pipeline fail, or did the corpus simply lack the correct information. Currently, these questions are typically addressed by manually reviewing queries — a process that is labor-intensive, inconsistent, and impractical at scale.Yet, evaluating retrieval is a key step to building a useful and reliable AI product. But doing so is hard. LLM evaluations only require an (Input, Output) pair. Meanwhile, retrieval benchmarks require the query, a snapshot of the entire corpus at that exact point in time, along with ground truth citations into exactly what the correct retrieval results should have been.Building such a benchmark is super hard. But, I strongly believe LLMs can and should be used to autonomously define and build benchmarks to compute deterministic metrics like recall, precision, mean reciprocal rank, etc.That’s why I am currently building an open-source benchmark creation framework that I will release soon. If you’d like to contribute, or if evaluation is something you’re curious about, feel free to reach out to me at pipitone@zeroentropy.dev

2 comments

PaulHoule6 months ago

Eval is a good place to start because nobody wants to do it. Back when I was working on search engines and between jobs I talked with many companies in the full-text search space and found it was unusual for them to do any eval work.Even though I had lived through using eval to make a search engine that was much better than competitors it turns out that enterprise search customers are more concerned about having 500 "integrations" to load stuff from different data points (each of which is like 5-50 lines of code exclusive of initializing connections to the various products) than they are about quality results. It's not like the people who buy it use it.---Listen. If you want to make the kind of query systems that people are dreaming about you are going to need some hybrid of RAG plus ordinary databases. For instance, if you are going to be asking questions that filters like "sample size of over 2000" you can build some system that extracts the sample sizes out of papers and puts them in a database column. DONE. On the other hand you could screw around with vectors and get to: 50%, 75%, 81%, 83.5%, 83.7% and such accuracies with increasing effort. Don't be that guy.

评论 #42300574 未加载

ghita_6 months ago

Interesting, did you figure out a way to solve those?

2 comments

PaulHoule6 months ago

评论 #42300574 未加载

ghita_6 months ago

Interesting, did you figure out a way to solve those?

I looked at 1000s of RAG queries to figure out the problem with semantic search

2 comments

I looked at 1000s of RAG queries to figure out the problem with semantic search

2 comments