For some context on why this is important: this benchmark was designed to be extremely challenging for LLMs, with problems requiring several hours or days of work by expert mathematicians. Currently, LLMs solve 2% of problems in the set (which is kept private to prevent contamination).<p>They even provide a quote from Terence Tao, which helped create the benchmark (alongside other Field medalists and IMO question writers):<p>> “These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”<p>Surprisingly, prediction markets [1] are putting 62% on AI achieving > 85% performance on the benchmark before 2028.<p>[1]: <a href="https://manifold.markets/MatthewBarnett/will-an-ai-achieve-85-performance-o?play=true" rel="nofollow">https://manifold.markets/MatthewBarnett/will-an-ai-achieve-8...</a>
Regarding keeping the test set private to avoid contamination, the comments about leakage are spot on. The real test set should always be <i>the future</i>.<p>We should evaluate LLMs on text from beyond their knowledge cutoff date, by computing their per-byte perplexity or per-byte compression ratio. There's a deep theoretical connection between compression and learning.<p>The intuition here is that being able to predict the future of science (or any topic, really) is indicative of true understanding. Slightly more formally: When ICLR 2025 announces and publishes the accepted papers, Yoshua Bengio is less surprised/perplexed by what's new than a fresh PhD student. And Terence Tao is less surprised/perplexed by what will be proven in math in the next 10 years than a graduate student in a related field.<p>This work has it right: <a href="https://ar5iv.labs.arxiv.org/html//2402.00861" rel="nofollow">https://ar5iv.labs.arxiv.org/html//2402.00861</a>
ScholarlyArticle: "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI" (2024) <a href="https://arxiv.org/abs/2411.04872" rel="nofollow">https://arxiv.org/abs/2411.04872</a> ..
<a href="https://epochai.org/frontiermath/the-benchmark" rel="nofollow">https://epochai.org/frontiermath/the-benchmark</a> :<p>> [Not even 2%]<p>> Abstract: <i>We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.</i>
Very cool. It'll be nice to have a benchmark that can be used to validate abstract reasoning capabilities because the hype is really starting to get out of hand.
Not very impressed by the problems they displayed but I guess there should be some good problems in the set given the comments (not in the sense that I find them super easy but they seems random and not super well-posed, and extremely artificial problems--in the sense that they seem to not be of particular mathematical interest[or at least the mathematical content of the problem is being deliberately hidden for testing purposes] but constructed according to some weird criteria). Would be happy to hear an elaboration on the comments by the well-known mathematicians