TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Open database of AI benchmark results with raw evaluation logs

1 pointsby tadamcz3 months ago

1 comment

tadamcz3 months ago
Hi, I&#x27;m the maintainer of the Epoch AI Benchmarking Hub.<p>We&#x27;re building a transparent public dataset of AI model performance.<p>We log and publish every prompt and response -- not just aggregate scores. We even store and display the json bodies for every API call. Example: <a href="https:&#x2F;&#x2F;logs.epoch.ai&#x2F;inspect-viewer&#x2F;484131e0&#x2F;viewer.html?log_file=https%3A%2F%2Flogs.epoch.ai%2Finspect_ai_logs%2FXPHDbKVUCPNCs5NoVWU8S3.eval" rel="nofollow">https:&#x2F;&#x2F;logs.epoch.ai&#x2F;inspect-viewer&#x2F;484131e0&#x2F;viewer.html?lo...</a><p>Each evaluation is linked to detailed information we collect about the model, including its release date, the organization behind it, and in some cases our estimate of the amount of compute used to train the model.<p>At the moment, the database features results from two benchmarks:<p>- GPQA Diamond: This is a higher-quality, challenging subset of the GPQA benchmark, which tests models’ ability to answer PhD-level multiple choice questions about chemistry, physics, and biology.<p>- MATH Level 5: This is a subset of the hardest questions from the MATH benchmark, a dataset of high-school level competition math problems.<p>We plan to rapidly expand our suite of benchmarks to create a thorough picture of AI progress, by adding benchmarks such as FrontierMath, SWE-Bench-Verified, and SciCodeBench.<p>Announcement post: <a href="https:&#x2F;&#x2F;epoch.ai&#x2F;blog&#x2F;benchmarking-hub-update" rel="nofollow">https:&#x2F;&#x2F;epoch.ai&#x2F;blog&#x2F;benchmarking-hub-update</a>