The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.<p>The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?<p>Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.
Very cool project!<p>I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.<p>It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?<p>Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?<p>I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.
Very neat. Uses the langchain method, here are some of the prompts:<p><a href="https://github.com/princeton-nlp/SWE-agent/blob/main/config/default.yaml">https://github.com/princeton-nlp/SWE-agent/blob/main/config/...</a>
For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.<p>I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.<p>Hit me up if you are interested.
Their demo is so similar to the Devin one I had to go look up the Devin one to check I wasnt watching the same demo. I feel like there might be a reason they both picked Sympy. Also I rarely put weight into demos. They are usually cherry-picked at best and outright fabricated at worst. I want to hear what 3rd parties have to say after trying these things.
If AI generated pull requests become a popular thing we'll see the end of public bug trackers.<p>(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)
Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.
I'm working on a somewhat similar project: <a href="https://github.com/plandex-ai/plandex">https://github.com/plandex-ai/plandex</a><p>While the overall goal is to build arbitrarily large, complex features and projects that are too much for ChatGPT or IDE-based tools, another aspect that I've put a lot of focus on is how to handle mistakes and corrections when the model starts going off the rails. Changes are accumulated in a protected sandbox separate from your project files, a diff review TUI is included that allows for bad changes to be rejected, all actions are version-controlled so you can easily go backwards and try a different approach, and branches are also included for trying out multiple approaches.<p>I think nailing this developer-AI feedback loop is the key to getting authentic productivity gains. We shouldn't just ask how well a coding tool can pass benchmarks, but what the failure case looks like when things go wrong.
What veterans in the field know that AI hasn’t tackled is that the majority of difficulty in development is dealing with complexity and ambiguity and a lot of it has to do with communication between people in natural language as well as reasoning in natural language about your system. These things are not solved by AI as it is now. If you can fully specify what you want with all of the detail and corner cases and situation handling then at some point AI might be able to make all of that for you. Great! Unfortunately, that’s the actual hard part! Not the implementation generally.
I would like something like this that helps me, as a green developer, find open source projects to contribute to.<p>For instance, I recently learned about how to replace setup.py with pyproject.toml for a large number of projects. I also learned how to publish packages to pypi. These changes significantly improve project ease and accessibility, and are very easy to do.<p>The main thing that holds people back is that python packaging documentation is notoriously cryptic - well I've already paid that cost, and now it's easy!<p>So I'm thinking of finding projects that are healthy, but haven't focused on modernizing their packaging or distributing their project through pypi.<p>I'd build human + agent based tooling to help me find candidates, propose the improvement to existing maintainers, then implement and deliver.<p>I could maybe upgrade 100 projects, then write up the adventure.<p>Anyone have inspiration/similar ideas, and wanna brainstorm?
I made a lot of money as I was paid hourly while working with a cadre of people I called "the defect generators".<p>I'm kind of sad that future generations will not have that experience...
And creates how many new ones?<p>This and Devin generate garbage code that will make any codebase worse.<p>It's a joke that 12.5% is even associated with the word "success".
A 1/8 chance of fixing a bug at the cost of a careful review and some corrections is not bad.<p>0% -> 12% improvement is not bad for two years either (I'm somewhat arbitrary picking the release date of ChatGPT). If this can be kept up for a few years we will have some extremely useful tooling. The cost can be relatively high as well, since engineering time is currently orders of magnitude more expensive than these tools.
I think that "Demo" link is just an extremely annoying version of an HTML presentation, so they could save me a shitload of clicking if they just dumped their presentation out to a PDF or whatever so I could read faster than watching it type out text as if it was live. It also whines a lot on the console about its inability to connect to a websocket server on 3000 but I don't know what it would do with a websocket connection if had it
If you are afraid that LLMs will replace you at your job, ask an LLM to write Rust code for reading a utf8 file character by character<p>Edit: Yes, it does write some code that is "close" enough, but in some cases it is wrong, in others it doesn't not do exactly what asked. I.e. needs supervision from someone who understands both the requirements, the code and the problems that may arise from the naive line that the LLM is taking. Mind you, the most popular the issue, the better the line LLM is taking. So in other words, IMHO is a glorified Stack Overflow. Just as there are engineers that copy-paste from SO without having any idea what the code does, there will be engineers that will just copy paste from LLM. Their work will be much better than if they used SO, but I think it's still nowhere to the mark of a Senior SWE and above.
Once we have this fully automated, any good developer could have a team of 100 robo SWEs and ship like crazy. The real competition is with those devs not with the bots.