Hi everyone,<p>I’m delvingg into optimizing RAG for ingest pipelines involving textual documents such as PDFs, Office documents (Word, PowerPoint, etc.), HTML files, emails, and plain text. My aim is to benchmark file chunking strategies acros these formats to evaluate the impact on the accuracy of the RAG process itself. For this, I need a dataset that includes:<p>- Textual Documents: A diverse set of PDFs, Office files, HTML documents, emails, and plain texts to test chunking strategies.<p>- Associated Questions: A set of questions or queries tailored to the content of these documents, to assess how well the RAG process retrieves and generates accurate information based on the chunks.<p>- Evaluation Metrics or Ground Truth: Ideally, the dataset would come with a benchmark or ground truth for the answers to these questions, allowing for a clear assessment of the RAG's accuracy and performance.<p>If anyone has come across datasets fitting this description or has experience creating or using similar datasets for RAG accuracy evaluation, I’d greatly appreciate your insights. Additionally, recommendations for tools, frameworks, or methodologies for conducting these evaluations would be incredibly valuable.<p>Thanks in advance for any help or direction you can provide!