TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Gemini 2.5 gets 24.4% on MathArena USAMO beating previous top score of 4.7%

54 点作者 alphabetting大约 1 个月前

3 条评论

IceHegel大约 1 个月前
I was playing around with having this model plot orbital trajectories and it was seriously impressive. Other top-tier models would struggle to get functional simulations working. Gemini 2.5 was able to do it after three or four turns in Cursor. It does feel like a meaningful step up in mathematical reasoning and math-dense coding.<p>On the other hand, if you try to play chess with any of these reasoning models (including Gemini 2.5), it basically doesn&#x27;t work at all. They keep forgetting where pieces are. Even with rl and sequential thinking on max, they consistently move pieces in impossible ways and mutate the board position.<p>In a recent test with Gemini 2.5, it used like 1700 thinking tokens to conclude it was in checkmate... but it wasn&#x27;t. It&#x27;s going to be very hard to trust these models to do new science or to operate outside of domains humans can verify while this kind of behavior continues.
评论 #43560754 未加载
评论 #43560776 未加载
adverbly大约 1 个月前
This does look like a large relative increase in score, but it seems like it comes from getting zero correct out of 6 to getting 1 and 1&#x2F;2 correct. I think it&#x27;s fair to say the sample size here is relatively small. Still, a record is a record! Congrats to the team for a new record!
评论 #43560658 未加载
jeffbee大约 1 个月前
Odd that ETHZ authors published less than a week ago excluding Gemini 2.5<p>&quot;PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD&quot;<p><a href="https:&#x2F;&#x2F;files.sri.inf.ethz.ch&#x2F;matharena&#x2F;usamo_report.pdf" rel="nofollow">https:&#x2F;&#x2F;files.sri.inf.ethz.ch&#x2F;matharena&#x2F;usamo_report.pdf</a>
评论 #43575520 未加载