TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

So how well is Claude playing Pokémon?

8 点作者 pxndx2 个月前

2 条评论

minimaxir2 个月前
A notable error in the article:<p>&gt; The second attempt got all the way to Vermilion City, finding a way through the infamous Mt. Moon maze and achieving two badges, so pretty close to the benchmark.<p>It did not make it to Vermillion City: it got Misty&#x27;s badge (with some fun battle RNG), then got stuck in Cerulean City and <i>could not get out</i>: the next objective was to go north to Bill&#x27;s House to get the S.S Anne ticket which is required before going to Vermillion City, but it just couldn&#x27;t do that.<p>Given the amount of loops in this livestream, I am somewhat skeptical of that benchmark results chart. There&#x27;s no way it somehow made it to Vermillion, beat the S.S Anne for HM Cut, and also beat Surge with the relative amount of actions implied by the chart.
rvz2 个月前
&gt; TL:DR: So, how&#x27;s it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level.<p>This is what happens when you try to apply hype technology (LLMs) on to every problem, especially with a company that has amassed too much hype too quickly.<p>The limits of said technology tell us that Claude has a very limited memory to plan in the game which is why it is obviously struggling. But expanding those limitations would cost Anthropic an enormous amount of money and compute even if they did that.<p>So you can clearly see that if LLM are unable to beat this game in an efficient manner to test for planning and reasoning, what hope is there for it with much challenging and complex scenarios which is required for so-called &quot;AGI&quot;?<p>The most important sentence in this article is this:<p>&gt;&gt; ...some new paradigm is yet required for them to be right.
评论 #43292099 未加载