科技回声

3 条评论

IceHegel大约 1 个月前

I was playing around with having this model plot orbital trajectories and it was seriously impressive. Other top-tier models would struggle to get functional simulations working. Gemini 2.5 was able to do it after three or four turns in Cursor. It does feel like a meaningful step up in mathematical reasoning and math-dense coding.<p>On the other hand, if you try to play chess with any of these reasoning models (including Gemini 2.5), it basically doesn't work at all. They keep forgetting where pieces are. Even with rl and sequential thinking on max, they consistently move pieces in impossible ways and mutate the board position.<p>In a recent test with Gemini 2.5, it used like 1700 thinking tokens to conclude it was in checkmate... but it wasn't. It's going to be very hard to trust these models to do new science or to operate outside of domains humans can verify while this kind of behavior continues.

评论 #43560754 未加载

评论 #43560776 未加载

adverbly大约 1 个月前

This does look like a large relative increase in score, but it seems like it comes from getting zero correct out of 6 to getting 1 and 1/2 correct. I think it's fair to say the sample size here is relatively small. Still, a record is a record! Congrats to the team for a new record!

评论 #43560658 未加载

jeffbee大约 1 个月前

Odd that ETHZ authors published less than a week ago excluding Gemini 2.5<p>"PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD"<p><a href="https://files.sri.inf.ethz.ch/matharena/usamo_report.pdf" rel="nofollow">https://files.sri.inf.ethz.ch/matharena/usamo_report.pdf</a>

评论 #43575520 未加载

Gemini 2.5 gets 24.4% on MathArena USAMO beating previous top score of 4.7%

3 条评论

Gemini 2.5 gets 24.4% on MathArena USAMO beating previous top score of 4.7%

3 条评论