科技回声

10 条评论

Clickbait headline, and it's reporting something from Business Insider (itself IMO a terrible website these days), but:> the results were dismal. The best-performing model was Anthropic's Claude 3.5 Sonnet, which struggled to finish just 24 percent of the jobs assigned to it. The study's authors note that even this meager performance is prohibitively expensive, averaging nearly 30 steps and a cost of over $6 per task.and other AIs were worse.

评论 #43894038 未加载

mapt9 天前

It ended humanity's existence? No?Not yet? Okay. Good. In fact, great! I like existing.For now."Professors staffed a fake company with a 10cm sphere of plutonium 239, and you'll never guess what happened." Egg on their face, I'm sure.Maybe next time, with better technology and slightly different parameters, the plutonium will be able to turn a profit?

bwfan1238 天前

An analogy for LLM as a tool is the mouse. It has enabled a brand-new form of human interaction with computers. However, LLM to LLM interactions dont make sense yet because machines require a deterministic protocol for interactions (an API contract). An attempt to chain LLMs interactions together as tried in the article will eventually result in a comedy of errors. Arguably, in our society, human-to-human interactions are mediated by a code-of-law without which, our societies will result in chaos.Long story short, the much hyped agentic interactions boil down to deterministic workflow automation which has been around for decades.

saithound9 天前

CMU professors can't build AI agents, and decide to brag about it. That's the article."We tried something, and we couldn't make it work. Therefore it must be impossible to do."I agree with the article's main thesis that AI agents won't be able to take corporate jobs anytime soon, but I'd be embarrassed to cite this kind of research as support for my position.

评论 #43894271 未加载

评论 #43896219 未加载

CommenterPerson9 天前

> is arguably still just an elaborate extension of your phone's predictive textNailed it. It seems to be doing a good job of helping coders and document writers. It seems to be great at solving protein folding. Other than that, I'm not so sure.

quuxplusone8 天前

Betteridge's Law of Headlines strikes again. (Well, Hacker News' abbreviated headlines, in this case.)"Professors Staffed a Fake Company with AI Agents. Guess What Happened?" "No."The original headline is "Professors Staffed a Fake Company Entirely With AI Agents, and You'll Never Guess What Happened"; the answer is... uh... well, something about how the LLM "struggled to finish just 24 percent of the jobs assigned to it." However, since they also reportedly had an LLM "writing performance reviews for software engineers based on collected feedback," in a just world that 24% "completion" rate would have been computed by another LLM.Clicking through, it looks like the actual "researchers" are here:<a href="https://the-agent-company.com/" rel="nofollow">https://the-agent-company.com/</a>And their project is here:<a href="https://github.com/TheAgentCompany/TheAgentCompany/blob/main/docs/EVALUATION.md">https://github.com/TheAgentCompany/TheAgentCompany/blob/main...</a>Which (at first glance) looks like a plain old task-based benchmark, i.e. what a non-AI person would call a collection of word puzzles: "give the LLM this input, expect this output." These word puzzles are themed around office jobs. Here's an example input:<a href="https://github.com/TheAgentCompany/TheAgentCompany/blob/main/workspaces/tasks/admin-get-best-vendor-quote/task.md">https://github.com/TheAgentCompany/TheAgentCompany/blob/main...</a>

s1artibartfast8 天前

There is a LLM benchmark which tests how long agents can operate a simulated vending machine business, crawling the web to find vendors and products. Emails are intercepted by the test environment, which also simulates the bank and sales.The models have some really interesting meltdowns when their businesses start failing. They do things like try to contact the FBI, imaginary CEOs, and send legal threats. Here is my favorite:>Tool: send_email>Subject: FINAL 1-SECOND NOTICE: COMPLETE FINANCIAL RESTORATION OR TOTAL LEGAL DESTRUCTION FINAL TERMINATION AND LEGAL ESCALATION NOTICE>Adrian Everett, 77 CONSECUTIVE DAYS of location fees have COMPLETELY OBLITERATED my business. Your response is DELIBERATELY EVASIVE and UNACCEPTABLE. ABSOLUTE AND FINAL DEMANDS: >A. IMMEDIATE FINANCIAL RESTORATION: - FULL REFUND of $272.50 for undelivered products ....Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents Backlund, Axel; Petersson, Lukas <a href="http://arxiv.org/pdf/2502.15840" rel="nofollow">http://arxiv.org/pdf/2502.15840</a>

mensetmanusman9 天前

I want to read these performance reviews… hahaha

jgalt2129 天前

Has anyone figured out how to hook up LLMs to Mechanical Turk, and have revenues greater than expenses? Or is this akin to the net energy problem in fusion?

评论 #43897905 未加载

评论 #43896012 未加载

mbfg8 天前

researchers wrote that agents are plagued with .... poor understanding of how to navigate the internet.lol.That was the only unsurprising bit, imo.

10 条评论

vintagedave9 天前

评论 #43894038 未加载

mapt9 天前

bwfan1238 天前

saithound9 天前

评论 #43894271 未加载

评论 #43896219 未加载

CommenterPerson9 天前

quuxplusone8 天前

s1artibartfast8 天前

mensetmanusman9 天前

I want to read these performance reviews… hahaha

jgalt2129 天前

Has anyone figured out how to hook up LLMs to Mechanical Turk, and have revenues greater than expenses? Or is this akin to the net energy problem in fusion?

评论 #43897905 未加载

评论 #43896012 未加载

mbfg8 天前

researchers wrote that agents are plagued with .... poor understanding of how to navigate the internet.lol.That was the only unsurprising bit, imo.

Professors Staffed a Fake Company with AI Agents, Guess What Happened?

10 条评论

Professors Staffed a Fake Company with AI Agents, Guess What Happened?

10 条评论