The vast majority of AI systems in production rely on basic semantic search to provide context. A single retrieval call into a vector database powers most Retrieval-Augmented Generation systems today. If you’ve tried using models like these, you know exactly how limited they are in truly understanding your data.<p>I looked into thousands of datapoints of actual user queries to clearly classify and determine exactly where and when semantic search starts to break down and provide missing or hallucinated results.<p>I pattern matched dozens of failure modes. Here are three of them. If you want to hear more you can reach me at pipitone@zeroentropy.dev<p>1. Negated Semantic Queries: “Which electric vehicle articles do not include any reference to Elon Musk?”<p>Both keyword and semantic searches will immediately retrieve specifically the electric vehicle articles that include a reference to Elon Musk.<p>2. Multi-Hop Queries “If the acquiring company fails to hold a shareholder’s meeting, what is the penalty?”<p>To answer this query, you need to work step-by-step. You would need to find the paragraph that says what happens when you fail to hold a shareholder meeting. Let’s say that such a search reveals that the agreement will be terminated in that circumstance. Then, you must search for what penalties are incurred by terminating the agreement. A simple semantic search will return paragraphs about shareholder’s meetings, and it will also return paragraphs about any kind of penalty — but, it will fail to link the two and realize that specifically a “termination penalty” must be boosted to the first place result.<p>Multi-hop queries require multiple steps of retrieval to get to the right information.<p>3. Fuzzy Filtering Queries “What diagnostic methods are suggested for early-stage cancer, in papers with a sample size of over 2000”<p>Sample sizes often occur in the first paragraph of a medical research article. Meanwhile, the specific diagnostic method is likely mentioned deep the article. So, these two pieces of information often do not occur in the same chunk. Your RAG pipeline will be happy to show diagnostic methods for early-stage cancer in articles that do not match the requested sample size — Not only that, but the correct answer will be almost impossible to find if “over 2000” is a rare filter.<p>----<p>Another interesting topic is evals for retrieval. At this point, I've talked to hundreds of developers, and discovered that retrieval evaluation is often overlooked, despite the impact on an AI’s intelligence and hallucination rate.<p>In most cases, evaluations occur at the end-user stage, either through direct feedback mechanisms like thumbs up/down ratings. However, few have a method of associating “thumbs down” ratings with exactly what went wrong and where. Was it a UX problem? Or an LLM hallucination? Did the retrieval pipeline fail, or did the corpus simply lack the correct information. Currently, these questions are typically addressed by manually reviewing queries — a process that is labor-intensive, inconsistent, and impractical at scale.<p>Yet, evaluating retrieval is a key step to building a useful and reliable AI product. But doing so is hard. LLM evaluations only require an (Input, Output) pair. Meanwhile, retrieval benchmarks require the query, a snapshot of the entire corpus at that exact point in time, along with ground truth citations into exactly what the correct retrieval results should have been.<p>Building such a benchmark is super hard. But, I strongly believe LLMs can and should be used to autonomously define and build benchmarks to compute deterministic metrics like recall, precision, mean reciprocal rank, etc.<p>That’s why I am currently building an open-source benchmark creation framework that I will release soon. If you’d like to contribute, or if evaluation is something you’re curious about, feel free to reach out to me at pipitone@zeroentropy.dev