FWIW, having written a simple RAG system from "scratch" (meaning not using frameworks or api calls), it's not more complicated than doing it this way with langchain etc.<p>This post is mostly about plumbing. It's probably the right way to do it if it needs to be scaled. But for learning, it obscures what is essentially simple stuff going on behind the scenes.
What brings to my attention in this article is the section named "Cold Start", where it generates questions based on a provided context.
I think it is a good way to cheaply generate an Q&A dataset that can later be used to finetune a model.
But the problem is that it generates some questions and answers of bad quality. All generated examples have issues:
- "What is the context discussing about?" - which context?
- "The context does not provide information on what Ray Tune is." - Not an answer
- "The context does not provide information on what external library integrations are." - same as before
I could only think of manual review to remove these noise questions. Any ideas on how to improve this QA generation? I've tried it before, but with paltry results.
Kudos to the team for a very detailed notebook going into things like pipeline evaluation wrt performance and costs etc. Even if we ignore the framework specific bits, it is a great guide to follow when building RAG systems in production.<p>We have been building RAG systems in production for a few months and have been tinkering with different strategies to get the most performance out of these pipelines. As others have pointed out, vector database may not be the right strategy for every problem. Similarly there are things like lost in the middle problems (<a href="https://arxiv.org/abs/2307.03172" rel="nofollow noreferrer">https://arxiv.org/abs/2307.03172</a>) that one may have to deal with. We put together our learnings building and optimizing these pipelines in a post at <a href="https://llmstack.ai/blog/retrieval-augmented-generation" rel="nofollow noreferrer">https://llmstack.ai/blog/retrieval-augmented-generation</a>.<p><a href="https://github.com/trypromptly/LLMStack">https://github.com/trypromptly/LLMStack</a> is a low-code platform we open-sourced recently that ships these RAG pipelines out of the box with some app templates if anyone wants to try them out.
While you don't strictly "need" a vector db to do RAG, as others have pointed out, vector databases excel when you're dealing with natural language - which is ambiguous.<p>This will be the case when you're exposing an interface to end users that they can submit arbitrary queries to - such as "how do I turn off reverse breaking".<p>By converting the user's query to vectors before sending it to your vector store, you're getting at the user's actual intent behind their words - which can help you retrieve more accurate context to feed to your LLM when asking it to perform a chat completion, for example.<p>This is also important if you're dealing with proprietary or non-public data that a search engine can't see. Context-specific natural language queries are well suited to vector databases.<p>We wrote up a guide with examples here: <a href="https://www.pinecone.io/learn/retrieval-augmented-generation/" rel="nofollow noreferrer">https://www.pinecone.io/learn/retrieval-augmented-generation...</a><p>And we've got several example notebooks you can run end to end using our free-tier here: <a href="https://docs.pinecone.io/page/examples" rel="nofollow noreferrer">https://docs.pinecone.io/page/examples</a>
My question is: if I want to use LLM to help me sift through a large amount of structured data, say for example all the logs for a bunch of different applications from a certain cloud environment, each with their own idiosyncrasies and specific formats (many GBs of data), can the RAG pattern be useful here?<p>Some of my concerns:<p>1) Is sentence embedding using an off-the-shelf embedding model going to capture the "meaning" of my logs? My answer is "probably not". For example, if a portion of my logs is in this format<p><pre><code> timestamp_start,ClassName,FunctionName,timestamp_end
</code></pre>
Will I be able to get meaningful embeddings that satisfy a query such as "what components in my system exhibited an anomalously high latency lately?" (this is just an example among many different queries I’d have)<p>Based on the little I know, it seems to me off-the-shelf embeddings wouldn't be able to match the embedding of my query with the embeddings for the relevant log lines, given the complexity of this task.<p>2) Is it going to be even feasible (cost/performance-wise) to use embeddings when one has a firehose of data coming through, or is it better suited for a mostly-static corpus of data (e.g. your typical corporate documentation or product catalog)?<p>I know that I can achieve something similar with a Code Interpreter-like approach, so in theory I could build a multi-step reasoning agent that starting from my query and the data would try to (1) discover the schema and then (2) crunch the data to try to get to my answer, but I don't know how scalable this approach would effectively be.
Wow this was indeed super comprehensive. A few things I noticed:<p>- In the cold start section, a couple of the synthetic_data responses say 'context does not provide info..'<p>- It's strange that retrieval_score would decrease while quality_score increases at the higher chunk sizes. Could this just be that the retrieved chunk is starting to be larger than the reference?<p>- Gpt 3.5 pricing looks out of date, it's currently $0.0015 for input for the 4k model<p>- Interesting that pricing needs to be shown on a log scale. Gpt-4 is 46x more expensive than llama 2 70B for ~.3 score increase. Training a simple classifier seems like a great way to handle this.<p>- I wonder how stable the quality_score assessment is given the exact same configuration. I guess the score differences between falcon-180b, llama-2-70b and gpt-3.5 are insignificant?<p>Is there a similarly comprehensive deep dive into chunking methods anywhere? Especially for queries that require multiple chunks to answer at all - producing more relevant chunks would have a massive impact on response quality I imagine.
Anyscale consistently posts great projects. Very cool to see the cost comparison and quality comparison. Not surprising to see that OSS is less expensive, but also rated as slightly lower quality than gpt-3.5-turbo.<p>I do wonder, is there some bias in quality measures? Using GPT 4 to evaluate GPT 4's output? <a href="https://www.linkedin.com/feed/update/urn:li:activity:7103398601090863104/" rel="nofollow noreferrer">https://www.linkedin.com/feed/update/urn:li:activity:7103398...</a>
Here is the blog post accompanying the notebook<p><a href="https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1" rel="nofollow noreferrer">https://www.anyscale.com/blog/a-comprehensive-guide-for-buil...</a>