I'm looking to build an LLM-based chatbot that can answer questions using a set of internal PDF documents. Has anyone worked on a similar use case with good success?
What approach and LLM stack did you use to solve this - RAG (Retrieval-Augmented Generation), fine-tuning, or embedding-based search?
RAG and embedding-based search are the same thing AFAIK.<p>My approach is to stuff as many documents as possible directly into the context. The context windows of frontier models are large enough for my use case of ~20-40 documents. Context windows are 128K tokens for gpt-4o/o1/o3 and 1M for Gemini.<p>When stuffing all of them in one query isn't possible, split the documents into multiple queries and aggregate the answers.<p>I've tried RAG. But matching query embeddings to chunk embeddings isn't that straightforward. I noticed that relevant content was missed even with my modest number of documents. Semantic matching using query embeddings is one level above dumb keyword-matching but one level below direct queries to LLMs.<p>Direct LLM queries seem to perform the best especially when some intermediate understanding is required (like "Based on these documents, infer the industries where X technique may be useful"). That's not possible with simple embedding search unless some of the documents specifically use the umbrella word "industry" or its close synonyms.<p>Embedding search can probably be improved - like generating a synthetic answer and matching that answer's embedding to chunk embeddings. But I haven't tried such techniques.
Hello, I found Aspose released LLM plugin:<p><a href="https://products.aspose.net/pdf/chat-gpt/" rel="nofollow">https://products.aspose.net/pdf/chat-gpt/</a><p>At glance I see it supports some advanced features:<p>Automatic detection of multiple languages.
Batching requests for reduce LLM API call frequency to lower operational costs.
There is one with Langchain+pydantic+llmwhisperer <a href="https://unstract.com/blog/comparing-approaches-for-using-llms-for-structured-data-extraction-from-pdfs/" rel="nofollow">https://unstract.com/blog/comparing-approaches-for-using-llm...</a>
Microsoft co-pilot does this out of the box<p>Just upload your documents to a OneDrive, Sharepoint, or Teams Site that you have access to and just start asking questions.