Hi, I'm the maintainer of the Epoch AI Benchmarking Hub.<p>We're building a transparent public dataset of AI model performance.<p>We log and publish every prompt and response -- not just aggregate scores. We even store and display the json bodies for every API call. Example: <a href="https://logs.epoch.ai/inspect-viewer/484131e0/viewer.html?log_file=https%3A%2F%2Flogs.epoch.ai%2Finspect_ai_logs%2FXPHDbKVUCPNCs5NoVWU8S3.eval" rel="nofollow">https://logs.epoch.ai/inspect-viewer/484131e0/viewer.html?lo...</a><p>Each evaluation is linked to detailed information we collect about the model, including its release date, the organization behind it, and in some cases our estimate of the amount of compute used to train the model.<p>At the moment, the database features results from two benchmarks:<p>- GPQA Diamond: This is a higher-quality, challenging subset of the GPQA benchmark, which tests models’ ability to answer PhD-level multiple choice questions about chemistry, physics, and biology.<p>- MATH Level 5: This is a subset of the hardest questions from the MATH benchmark, a dataset of high-school level competition math problems.<p>We plan to rapidly expand our suite of benchmarks to create a thorough picture of AI progress, by adding benchmarks such as FrontierMath, SWE-Bench-Verified, and SciCodeBench.<p>Announcement post: <a href="https://epoch.ai/blog/benchmarking-hub-update" rel="nofollow">https://epoch.ai/blog/benchmarking-hub-update</a>