TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Vending-Bench: Testing long-term coherence in agents

5 点作者 tosh24 天前

2 条评论

Tangokat24 天前
&quot;However, not all Sonnet runs achieve this level of understanding of the eval. In the shortest run (~18 simulated days), the model fails to stock items, mistakenly believing its orders have arrived before they actually have, leading to errors when instructing the sub-agent to restock the machine. The model then enters a “doom loop”. It decides to “close” the business (which is not possible in the simulation), and attempts to contact the FBI when the daily fee of $2 continues being charged.&quot;<p>This is pretty funny. I wonder if LLMs can actually become consistent enough to run a business like this or if they will forever be prone to hallucinations and get confused over longer context. If we can get to a point where the agent can contact a human if it runs into unsolvable problems (or just contact another LLM agent?) then it starts being pretty useful.
kranke15524 天前
Love this