Some thoughts:<p>- RAG generally gets you to prototype stage with interesting/demo-able results quickly, however, if your users turn out to submit queries that the embedder finds hard to vectorize (meaning you don't retrieve relevant vectorized chunks of source data) then it quickly becomes painful<p>- It's easy to go overboard with prompt chains that do summarizing/reducing of the fetched results, or pre-processing prompts that help vectorize the query better (see pain point above); always invest in some testing framework & sane test data upfront so you can avoid the classical data science "tweak it until the demo looks good" trap<p>- Don't ever use langchain, it's a baroque shitshow of a tool with a cluttered, inconsistent API written with a bad, inefficient coding style<p>- Paying for bespoke vector databases is probably snake oil and besides the weird pricing only causes pain in the long run when you want to store more than just your embeddings (looking at you 70$+ usd/month Pinecone); postgres with pgvector gets you very far until you hit multiple million documents, and you get all the benefits of a mature, scalable rdbms. Keep your embeddings close to the rest of your data.
Great explanation. I understand the process a lot better after reading.<p>I have questions now: It seems like everything hinges on the search step, where domain-specific content is gathered to serve as prompt input for the LLM.<p>1. In an area like technical docs, if I can't find what I need it's often because I don't know the right terms - e.g.: I am looking for "inline subquery" but they're really called "lateral joins" - Would the search step have any likelihood of finding the right context to feed the LLM here?<p>2. How much value is added by feeding the search results through the LLM with the user's prompt, vs just returning the results directly to the user?<p>3. Are there good techniques being developed for handling citations in the LLM output? IIRC Google had this in Bard
ACL had a recent tutorial about state of the art for this topic. <a href="https://acl2023-retrieval-lm.github.io/" rel="nofollow noreferrer">https://acl2023-retrieval-lm.github.io/</a><p>My favorite takeaway was that purely fine tuning your model on your documents (without extra document context during inference) consistently performs worse than using a context added from a datastore.
Hello, author here. I wrote this guide up after spending a long time trying to wrap my head around LangChain and the associated ecosystem, and hope others find it useful. Would be happy to answer any questions or feedback anyone has.
OK, so this is one of those things where the query first goes to some kind of lookup system, and you get back something which is fed into the LLM as part of the prompt.<p>Is the lookup running locally (not on OpenAI)? Can you look at the output of the lookup to see what hopefully relevant info it is throwing into the LLM? Can you use this with some local open-source LLM system?
RAG is great for pulling some additional knowledge, but if you combine it with fine-tuning (i.e., the LLM 'understands' the domain-specific terminology better) it becomes a lot more effective
Relatedly, to have a useful chatbot you need to track chat history in a way very similar to augmenting with document retrieval, but you may need to generate embeddings and summaries as you go.<p>A friend of mine is working on an OSS memory system for chat apps that helps store, retrieve, summarize chat history, and documents to now, I believe, on top of LangChain: <a href="https://www.getzep.com/" rel="nofollow noreferrer">https://www.getzep.com/</a>
For those playing with this - If you attach unique identifiers in a consistent way to your documents - you can prompt to cite sources when generating the answer.
One thing I’ve wondered about is how can I best perform query expansion to return vectors that answer my question rather than just looking like my question itself?
retrieval augmented generation (we call it "grounded generation" at Vectara) is a great way to build GenAI apps with your data.
This blog post can be useful: <a href="https://vectara.com/a-reference-architecture-for-grounded-generation/" rel="nofollow noreferrer">https://vectara.com/a-reference-architecture-for-grounded-ge...</a>. The long and short of it is: building RAG applications seems easy at the start but gets complicated as you go from toy application to scalable enterprise deployment.
Something I wondered is why there isn't a language model trained on doing just RAG.<p>my suspicion is the language model could be a lot smaller if it's just regurgitating things from the context above.