We're sharing some experiments in designing RAG systems via the open source PaperQA2 system (<a href="https://github.com/Future-House/paper-qa">https://github.com/Future-House/paper-qa</a>). PaperQA2's design is interesting because it isn't concerned with cost, so it uses expensive operations like agentic tool calling and LLM based re-ranking and contextual summarization for each query.<p>Even though the costs are higher, we see that the RAG accuracy gains (in question-answering tasks) are worth it. Including LLM chunk re-ranking and contextual summaries in your RAG flow also makes the system robust to changes in chunk sizes, parsing oddities and embedding model shortcomings. It's one of the largest drivers of performance we could find.