TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

13 点作者 mfiguiere大约 2 个月前

1 comment

cuuupid大约 2 个月前
&gt; The most frequent failure mode among human participants is the inability to find a correct solution Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated LLMs consistently claimed to have solved the problems.<p>This is exactly the problem that needs to be solved. The yes-man nature of LLMs is the biggest inhibitor to progress, as a model that cannot self evaluate well cannot learn.<p>If we solve this though, combined with reasoning, I feel somewhat confident we will be able to achieve “AGI,” at least over text-accessible domains.
评论 #43547182 未加载