Evaluation dataset designed to test the capabilities of Retrieval-Augmented Generation (RAG) systems. Paper with details and experiments is available on arXiv: <a href="https://arxiv.org/abs/2409.12941" rel="nofollow">https://arxiv.org/abs/2409.12941</a>.<p>Dataset Overview
824 challenging multi-hop questions requiring information from 2-15 Wikipedia articles
Questions span diverse topics including history, sports, science, animals, health, etc.
Each question is labeled with reasoning types: numerical, tabular, multiple constraints, temporal, and post-processing
Gold answers and relevant Wikipedia articles provided for each question<p>Key Features
Tests end-to-end RAG capabilities in a unified framework
Requires integration of information from multiple sources
Incorporates complex reasoning and temporal disambiguation
Designed to be challenging for state-of-the-art language models<p>Usage
This dataset can be used to:<p>Evaluate RAG system performance
Benchmark language model factuality and reasoning
Develop and test multi-hop retrieval strategies