We build a corporate RAG for a government entity. What I've learned so far by applying an experimental A/B testing approach to RAG using RAGAS metrics:<p>- Hybrid Retrieval (semantic + vector) and then LLM based Reranking made no significant change using synthetic eva-questions<p>- HyDE decreased answer quality and retrieval quality severly when measured with RAGAS using synthetic eval-questions<p>(we still have to do a RAGAS eval using expert and real user questions)<p>So yes, hybrid retrieval is always good - that's no news to anyone building production ready or enterprise RAG solutions. But one method doesn't always win. We found semantic search of Azure AI Search being sufficient as a second method, next to vector similarity. Others might find BM25 great, or a fine tuned query post processing SLM. Depends on the use case. Test, test, test.<p>Next things we're going to try:<p>- RAPTOR<p>- SelfRAG<p>- Agentic RAG<p>- Query Refinement (expansion and sub-queries)<p>- GraphRAG<p>Learning so far:<p>- Always use a baseline and an experiment to try to refute your null hypothesis using measures like RAGAS or others.<p>- Use three types of evaluation questions/answers: 1. Expert written q&a, 2. Real user questions (from logs), 3. Synthetic q&a generated from your source documents
My favorite thing about this is the way it takes advantage of prompt caching.<p>That's priced at around 1/10th of what the prompts would normally cost if they weren't cached, which means that tricks like this (running every single chunk against a full copy of the original document) become feasible where previously they wouldn't have financially made sense.<p>I bet there are all sorts of other neat tricks like this which are opened up by caching cost savings.<p>My notes on contextual retrieval: <a href="https://simonwillison.net/2024/Sep/20/introducing-contextual-retrieval/" rel="nofollow">https://simonwillison.net/2024/Sep/20/introducing-contextual...</a> and prompt caching: <a href="https://simonwillison.net/2024/Aug/14/prompt-caching-with-claude/" rel="nofollow">https://simonwillison.net/2024/Aug/14/prompt-caching-with-cl...</a>
To add some context, this isn't that novel of an approach. A common approach to improve RAG results is to "expand" the underlying chunks using an llm, so as to increase the semantic surface area to match against. You can further improve your results by running query expansion using HyDE[1], though it's not always an improvement. I use it as a fallback.<p>I'm not sure what Anthropic is introducing here. I looked at the cookbook code and it's just showing the process of producing said context, but there's no actual change to their API regarding "contextual retrieval".<p>The one change is prompt caching, introduced a month back, which allows you to very cheaply add better context to individual chunks by providing the entire (long) document as context. Caching is an awesome feature to expose to developers and I don't want to take anything away from that.<p>However, other than that, the only thing I see introduced is just a cookbook on how to do a particular rag workflow.<p>As an aside, Cohere may be my favorite API to work with. (no affiliation) Their RAG API is a delight, and unlike anything else provided by other providers. I highly recommend it.<p>1: <a href="https://arxiv.org/abs/2212.10496" rel="nofollow">https://arxiv.org/abs/2212.10496</a>
We're doing something similar. We first chunk the documents based on h1,h2,h3 headings. Then we add headers in the beginning of the chunk as a context. As an imagenary example, instead of one chunk being:<p><pre><code> The usual dose for adults is one or two 200mg tablets or
capsules 3 times a day.
</code></pre>
It is now something like:<p><pre><code> # Fever
## Treatment
---
The usual dose for adults is one or two 200mg tablets or
capsules 3 times a day.
</code></pre>
This seems to work pretty well, and doesn't require any LLMs when indexing documents.<p>(Edited formatting)
I'm not a fan of this technique. I agree the scenario they lay out is a common problem, but the proposed solution feels odd.<p>Vector embeddings have bag-of-words compression properties and can over-index on the first newline separated text block to the extent that certain indices in the resulting vector end up much closer to 0 than they otherwise would. With quantization, they can eventually become 0 and cause you to lose out on lots of precision with the dense vectors. IDF search overcomes this to some extent, but not enough.<p>You can "semantically boost" embeddings such that they move closer to your document's title, summary, abstract, etc. and get the recall benefits of this "context" prepend without polluting the underlying vector. Implementation wise it's a weighted sum. During the augmentation step where you put things in the context window, you can always inject the summary chunk when the doc matches as well. Much cleaner solution imo.<p>Description of "semantic boost" in the Trieve API[1]:<p>>semantic_boost: Semantic boost is useful for moving the embedding vector of the chunk in the direction of the distance phrase. I.e. you can push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factorL2Distance closer to or away from the distance_phrase point along the line between the two points.<p>[1]:<a href="https://docs.trieve.ai/api-reference/chunk/create-or-upsert-chunk-or-chunks">https://docs.trieve.ai/api-reference/chunk/create-or-upsert-...</a>
The technique I find most useful is to implement a “linked list” strategy where a chunk has multiple pointers to the entry it is referenced by. This task is done manually, but the diversity of the ways you can reference a particular node go up dramatically.<p>Another way to look at it, comments. Imagine every comment under this post is a pointer back to the original post. Some will be close in distance, and others will be farther, due to the perception of the authors of the comments themselves. But if you assign each comment a “parent_id”, your access to the post multiplies.<p>You can see an example of this technique here [1]. I don’t attempt to mind read what the end user will query for, I simply let them tell me, and then index that as a pointer. There are only a finite number of options to represent a given object. But some representations are very, very, very far from the semantic meaning of the core object.<p>[1] - <a href="https://x.com/yourcommonbase/status/1833262865194557505" rel="nofollow">https://x.com/yourcommonbase/status/1833262865194557505</a>
The statement about just throwing 200k tokens to get best answer for smaller datasets goes against my experience. I commonly find as my prompt gets larger, the less consistent the output becomes, and the poorer following instructions becomes. Does anyone else experience this or a well known way to avoid this? It seems to happen at much less than even 25k tokens.
Interesting. One problem I'm facing is using RAG to retrieve applicable rules instead of knowledge (chunks): only rules that may apply to the context should be injected into the context. I haven't done any experiment, but one approach that I think could work would be to train small classifiers to determine whether a specific rule <i>could</i> apply. The main LLM would be tasked with determining whether the rule indeed applies or not for the current context.<p>An example: let's suppose you're using an LLM to play a multi user dungeon. In the past your character has behaved badly with taxis so that the game has decided to create a rule that says that whenever you try to enter a taxi you're kicked out: "we know who you are, we refuse to have you as a client until you formally apologize to the taxi company director". Upon apologizing, the rule is removed. Note that the director of the taxi company could be another player and be the one who issued the rule in the first place, to be enforced by his NPC fleet of taxis.<p>I'm wondering how well this could scale (with respect of number of active rules) and to which extent traditional RAG could be applied. It seems deciding whether a rule applies or not is a problem that is more abstract and difficult than deciding whether a chunk of knowledge is relevant or not.<p>In particular the main problem I have identified that makes it more difficult is the following dependency loop that doesn't appear with knowledge retrieval: you need to retrieve a rule to identify whether it applies or not. Does anyone know how this problem could be solved ?
> If your knowledge base is smaller than 200,000 tokens (about 500 pages of material)<p>I would prefer that anthropic just release their tokeniser so we don't have to make guesses.
This sounds a lot like how we used to do research, by reading books and writing any interesting quotes on index cards, along with where they came from. I wonder if prompting for that would result in better chunks? It might make it easier to review if you wanted to do it manually.
I wish they included the datasets they used for the evaluations. As far as I can tell, in appendix II they include some sample questions, answers, and golden chunks but they do not give the entire dataset or give an explicit information on exactly what the datasets are.<p>Does anyone know if the datasets they used for the evaluation are publicly available or if they give more information on the datasets than what's in appendix II?<p>There are standard publically available datasets for this type of evaluation, like MTEB (<a href="https://github.com/embeddings-benchmark/mteb">https://github.com/embeddings-benchmark/mteb</a>). I wonder how this technique does on the MTEB dataset.
Even with prompt caching this adds a huge extra time to your vector database create/update, right? That may be okay for some use cases but I’m always wary of adding multiple LLM layers into these kinds of applications. It’s nice for the cloud LLM providers of course.<p>I wonder how it would work if you generated the contexts yourself algorithmically. Depending on how well structured your docs are this could be quite trivial (eg for an html doc insert the title > h1 > h2 > chunk).
I just took the time to read through all source code and docs. Nice ideas. I like to experiment with LLMs running on my local computer so I will probably convert this example to use the light weight Python library Rank-BM25 instead of Elastic Search, and a long context model running on Ollama. I wouldn’t have prompt caching though.<p>This example is well written and documented, easy to understand. Well done.
I don't know anything about AI but I've always wished I could just upload a bunch of documents/books and the AI would perform some basic keyword searches to figure out what is relevant, then auto include that in the prompt.
Looking forward to some guidance on "chunking":<p>"Chunk boundaries: Consider how you split your documents into chunks. The choice of chunk size, chunk boundary, and chunk overlap can affect retrieval performance1."
I've been wondering for a while if having ElasticSearch as just another function to call might be interesting. If the LLM can just generate queries it's an easy deployment.
I guess this does give some insights. Using a more space efficient language for your codebase will mean more functionality in the ais context window when working with Claude and code.