I launched a comparable tool recently[1]. I've actually specifically <i>not</i> been calling it an "AI Software Engineer" as I don't think that's the right framing for the capabilities of current models.<p>My focus has been on giving the developer as much fine-grained control of the LLM-based agent as possible in order to tighten the feedback loop and work around bad output (which is inevitable, unfortunately).<p>In self-driving parlance, I think of it as L3. The agent can work autonomously, but the best results are achieved by the developer keeping their hands on the wheel and making corrections when needed. Imho that is currently the sweet spot for real productivity.<p>1 - <a href="https://github.com/plandex-ai/plandex">https://github.com/plandex-ai/plandex</a>
AI software engineers like Devin and SWE-agent are frequently compared to human software engineers. However SWE-bench, the benchmark upon which this comparison is made, only applies to Python tasks, most of which involve making single-file changes of 15 lines or less and relies solely on unit tests to evaluate their correctness. My aim is to give you a framework to assess if AI's progress against this benchmark is relevant to your organization's work.