TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI

185 pointsby sshroot7 months ago

6 comments

agucova7 months ago
For some context on why this is important: this benchmark was designed to be extremely challenging for LLMs, with problems requiring several hours or days of work by expert mathematicians. Currently, LLMs solve 2% of problems in the set (which is kept private to prevent contamination).<p>They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):<p>&gt; “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”<p>Surprisingly, prediction markets [1] are putting 62% on AI achieving &gt; 85% performance on the benchmark before 2028.<p>[1]: <a href="https:&#x2F;&#x2F;manifold.markets&#x2F;MatthewBarnett&#x2F;will-an-ai-achieve-85-performance-o?play=true" rel="nofollow">https:&#x2F;&#x2F;manifold.markets&#x2F;MatthewBarnett&#x2F;will-an-ai-achieve-8...</a>
评论 #42097569 未加载
评论 #42097567 未加载
评论 #42097936 未加载
评论 #42097804 未加载
评论 #42099412 未加载
评论 #42097390 未加载
bravura7 months ago
Regarding keeping the test set private to avoid contamination, the comments about leakage are spot on. The real test set should always be <i>the future</i>.<p>We should evaluate LLMs on text from beyond their knowledge cutoff date, by computing their per-byte perplexity or per-byte compression ratio. There&#x27;s a deep theoretical connection between compression and learning.<p>The intuition here is that being able to predict the future of science (or any topic, really) is indicative of true understanding. Slightly more formally: When ICLR 2025 announces and publishes the accepted papers, Yoshua Bengio is less surprised&#x2F;perplexed by what&#x27;s new than a fresh PhD student. And Terence Tao is less surprised&#x2F;perplexed by what will be proven in math in the next 10 years than a graduate student in a related field.<p>This work has it right: <a href="https:&#x2F;&#x2F;ar5iv.labs.arxiv.org&#x2F;html&#x2F;&#x2F;2402.00861" rel="nofollow">https:&#x2F;&#x2F;ar5iv.labs.arxiv.org&#x2F;html&#x2F;&#x2F;2402.00861</a>
评论 #42098244 未加载
westurner7 months ago
ScholarlyArticle: &quot;FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI&quot; (2024) <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2411.04872" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2411.04872</a> .. <a href="https:&#x2F;&#x2F;epochai.org&#x2F;frontiermath&#x2F;the-benchmark" rel="nofollow">https:&#x2F;&#x2F;epochai.org&#x2F;frontiermath&#x2F;the-benchmark</a> :<p>&gt; [Not even 2%]<p>&gt; Abstract: <i>We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.</i>
评论 #42097700 未加载
benchmarkist7 months ago
Very cool. It&#x27;ll be nice to have a benchmark that can be used to validate abstract reasoning capabilities because the hype is really starting to get out of hand.
评论 #42097596 未加载
评论 #42097559 未加载
MichaelRazum7 months ago
How do they solve the 2%? This is the question. If those problems were unseen, that might be already very impressive.
Davidzheng7 months ago
Not very impressed by the problems they displayed but I guess there should be some good problems in the set given the comments (not in the sense that I find them super easy but they seems random and not super well-posed, and extremely artificial problems--in the sense that they seem to not be of particular mathematical interest[or at least the mathematical content of the problem is being deliberately hidden for testing purposes] but constructed according to some weird criteria). Would be happy to hear an elaboration on the comments by the well-known mathematicians
评论 #42100428 未加载
评论 #42103558 未加载