We tried something similar and found much better results with o1 pro than o3 mini. RAG seems to require a level of world knowledge that the mini models don’t have.<p>This comes at the cost of significantly higher latency and cost. But for us, answer quality is a much higher priority.
I found it interesting the parts that discussed current limitations of llm's understanding of tools, despite apparent reasoning abilities, it didn't seem to have an intuitive understanding of when to use the specific search tools.<p>I wonder whether this would benefit from a fine tuned llm module for that specific step, or even by providing a set of examples in the prompt of when to use what tool?
Am I correct in reading that the RAG pipeline runs in realtime in response to a user query?<p>If so, then I would suggest that you run it ahead of time and generate possible questions from the LLM based on the context of the current semantically split chunk.<p>That way you only need to compare the embeddings at query time and it will already be pre-sorted and ranked.<p>The trick, of course, is chunking it correctly and generating the right questions. But in both cases I would look to the LLM to do that.<p>Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.
When aggregating data from multiple systems, how do you handle the case of only searching against data chunks that the user is authorized to view? And if those permissions change?