TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

OpenAI operator and Anthropic performed below average on chimp memory test

1 点作者 ykhli4 个月前

1 comment

ykhli4 个月前
I benchmarked OpenAI operator and Anthropic&#x27;s computer use on the human benchmark chimp test (<a href="https:&#x2F;&#x2F;humanbenchmark.com&#x2F;tests&#x2F;chimp" rel="nofollow">https:&#x2F;&#x2F;humanbenchmark.com&#x2F;tests&#x2F;chimp</a>).<p>Before I began the test, I thought the agents would be much better at this task than most humans -- after all they should have better, more stateful memory than us. The results are intriguing.<p>Here are the scores from 10 attempts: OpenAI operator: 5, 5, 6, 5, 5, 4, 6, 5, 5, 5 Anthropic computer use agent: 7, 9, 6 (rate limited), 12, 9, 7, 9, 11, 12, 6 (rate limited)