This is super helpful. I'm building a document question-answering service over a custom data corpus (related to Saivism, a sect of Hinduism). So far the first pass has been to manually chunk the text (based on headings, chapters etc.) and then I've used OpenAI's embedding service and storing the embeddings in Pinecone. All stiched together using LangChain. To ask a question, the question is again embedded, then searched against the vector store, then the related documents are provided as context to the LLM along with the question.<p>So far it was really easy to set up the prototype, but the results weren't as great as I had hoped, so I'm excited to see how I could improve it.<p>Edit: wow, I didn't see this before. LangChain implements one of the featured article's suggestions (HyDE) - <a href="https://python.langchain.com/en/latest/modules/chains/index_examples/hyde.html" rel="nofollow">https://python.langchain.com/en/latest/modules/chains/index_...</a>
This is one of the areas of LLMs that I find most interesting. So far, I've found simple question-answering over vectorstores to be a lackluster experience. In particular, the more information you embed and stick into the vectorstore, the less useful the system becomes as you are less likely to get the information you're looking for (especially if the users don't understand their queries need to look like the docs the want to ask about.<p>I haven't had a chance to try out hypothetical embedded docs yet, but I expect they only provide a marginal improvement (especially if QAing over proprietary data or information).<p>I'd love to see any other interesting, more up-to-date resources anyone has found on this topic. I found this recent paper interesting: <a href="https://arxiv.org/abs/2304.11062" rel="nofollow">https://arxiv.org/abs/2304.11062</a>
This document seems to have been written before the toolformer paper[0], which fine tunes the model to use tools (e.g search) to retrieve information.<p>[0]: <a href="https://arxiv.org/abs/2302.04761" rel="nofollow">https://arxiv.org/abs/2302.04761</a>
Few other helpful options recently added to Langchain:<p>1. Extraction for query filters - <a href="https://twitter.com/hwchase17/status/1651617956881924096?s=46&t=gkyxL9FAhSE-DiMAkwTkcg" rel="nofollow">https://twitter.com/hwchase17/status/1651617956881924096?s=4...</a><p>2. Contextual compression to eek more out of prompt stuffing - <a href="https://twitter.com/hwchase17/status/1649428295467905025?s=46&t=gkyxL9FAhSE-DiMAkwTkcg" rel="nofollow">https://twitter.com/hwchase17/status/1649428295467905025?s=4...</a><p>And then it’s been there’s existing great utility chains for map-reduce, with re-ranking, etc for more ways to apply LLM completions over large documents and/or large sets of documents:
3. <a href="https://m.youtube.com/watch?v=f9_BWhCI4Zo">https://m.youtube.com/watch?v=f9_BWhCI4Zo</a>
Sentence embeddings have been great for improving semantic search, but I am still struggling with finding relevant documents for numerical values. Questions like "what people where born in 1992" or "people with at least 4 children". One thing I can do is pre-process the data by transforming the date of birth into boomers/zoomers/millenials and the like but this does not help on the question side if people don't know what to ask