TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Open database of AI benchmark results with raw evaluation logs

1 点作者 tadamcz4 个月前

1 comment

tadamcz4 个月前
Hi, I&#x27;m the maintainer of the Epoch AI Benchmarking Hub.<p>We&#x27;re building a transparent public dataset of AI model performance.<p>We log and publish every prompt and response -- not just aggregate scores. We even store and display the json bodies for every API call. Example: <a href="https:&#x2F;&#x2F;logs.epoch.ai&#x2F;inspect-viewer&#x2F;484131e0&#x2F;viewer.html?log_file=https%3A%2F%2Flogs.epoch.ai%2Finspect_ai_logs%2FXPHDbKVUCPNCs5NoVWU8S3.eval" rel="nofollow">https:&#x2F;&#x2F;logs.epoch.ai&#x2F;inspect-viewer&#x2F;484131e0&#x2F;viewer.html?lo...</a><p>Each evaluation is linked to detailed information we collect about the model, including its release date, the organization behind it, and in some cases our estimate of the amount of compute used to train the model.<p>At the moment, the database features results from two benchmarks:<p>- GPQA Diamond: This is a higher-quality, challenging subset of the GPQA benchmark, which tests models’ ability to answer PhD-level multiple choice questions about chemistry, physics, and biology.<p>- MATH Level 5: This is a subset of the hardest questions from the MATH benchmark, a dataset of high-school level competition math problems.<p>We plan to rapidly expand our suite of benchmarks to create a thorough picture of AI progress, by adding benchmarks such as FrontierMath, SWE-Bench-Verified, and SciCodeBench.<p>Announcement post: <a href="https:&#x2F;&#x2F;epoch.ai&#x2F;blog&#x2F;benchmarking-hub-update" rel="nofollow">https:&#x2F;&#x2F;epoch.ai&#x2F;blog&#x2F;benchmarking-hub-update</a>