TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

How Do AI Software Engineers Like Devin Compare to Humans?

7 点作者 htormey大约 1 年前

2 条评论

danenania大约 1 年前
I launched a comparable tool recently[1]. I&#x27;ve actually specifically <i>not</i> been calling it an &quot;AI Software Engineer&quot; as I don&#x27;t think that&#x27;s the right framing for the capabilities of current models.<p>My focus has been on giving the developer as much fine-grained control of the LLM-based agent as possible in order to tighten the feedback loop and work around bad output (which is inevitable, unfortunately).<p>In self-driving parlance, I think of it as L3. The agent can work autonomously, but the best results are achieved by the developer keeping their hands on the wheel and making corrections when needed. Imho that is currently the sweet spot for real productivity.<p>1 - <a href="https:&#x2F;&#x2F;github.com&#x2F;plandex-ai&#x2F;plandex">https:&#x2F;&#x2F;github.com&#x2F;plandex-ai&#x2F;plandex</a>
htormey大约 1 年前
AI software engineers like Devin and SWE-agent are frequently compared to human software engineers. However SWE-bench, the benchmark upon which this comparison is made, only applies to Python tasks, most of which involve making single-file changes of 15 lines or less and relies solely on unit tests to evaluate their correctness. My aim is to give you a framework to assess if AI&#x27;s progress against this benchmark is relevant to your organization&#x27;s work.