科技回声

1 comment

cuuupid大约 2 个月前

> The most frequent failure mode among human participants is the inability to find a correct solution Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated LLMs consistently claimed to have solved the problems.<p>This is exactly the problem that needs to be solved. The yes-man nature of LLMs is the biggest inhibitor to progress, as a model that cannot self evaluate well cannot learn.<p>If we solve this though, combined with reasoning, I feel somewhat confident we will be able to achieve “AGI,” at least over text-accessible domains.

评论 #43547182 未加载

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

1 comment

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

1 comment