TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

13 pointsby mfiguiereabout 2 months ago

1 comment

cuuupidabout 2 months ago
&gt; The most frequent failure mode among human participants is the inability to find a correct solution Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated LLMs consistently claimed to have solved the problems.<p>This is exactly the problem that needs to be solved. The yes-man nature of LLMs is the biggest inhibitor to progress, as a model that cannot self evaluate well cannot learn.<p>If we solve this though, combined with reasoning, I feel somewhat confident we will be able to achieve “AGI,” at least over text-accessible domains.
评论 #43547182 未加载