Princeton group open sources "SWE-agent", with 12% fix rate for GitHub issues

307 pointsby asteroidzabout 1 year ago

22 comments

dimalabout 1 year ago

The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.

评论 #39911606 未加载

评论 #39913923 未加载

评论 #39915153 未加载

评论 #39913241 未加载

评论 #39913067 未加载

评论 #39918655 未加载

评论 #39912432 未加载

评论 #39916113 未加载

评论 #39914546 未加载

anotherpaulgabout 1 year ago

Very cool project!I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.

评论 #39910628 未加载

评论 #39914253 未加载

matthewaveryusaabout 1 year ago

Very neat. Uses the langchain method, here are some of the prompts:<a href="https://github.com/princeton-nlp/SWE-agent/blob/main/config/default.yaml">https://github.com/princeton-nlp/SWE-agent/blob/main/config/...</a>

评论 #39910462 未加载

paraditeabout 1 year ago

For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.Hit me up if you are interested.

评论 #39912529 未加载

lispisokabout 1 year ago

Their demo is so similar to the Devin one I had to go look up the Devin one to check I wasnt watching the same demo. I feel like there might be a reason they both picked Sympy. Also I rarely put weight into demos. They are usually cherry-picked at best and outright fabricated at worst. I want to hear what 3rd parties have to say after trying these things.

评论 #39911258 未加载

JonChesterfieldabout 1 year ago

If AI generated pull requests become a popular thing we'll see the end of public bug trackers.(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)

评论 #39914209 未加载

评论 #39914129 未加载

bwestergardabout 1 year ago

Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.

评论 #39909546 未加载

评论 #39911313 未加载

评论 #39911025 未加载

rwmjabout 1 year ago

Do we know how much extra work it created for the real people who had to review the proposed fixes?

评论 #39911395 未加载

danenaniaabout 1 year ago

I'm working on a somewhat similar project: <a href="https://github.com/plandex-ai/plandex">https://github.com/plandex-ai/plandex</a>While the overall goal is to build arbitrarily large, complex features and projects that are too much for ChatGPT or IDE-based tools, another aspect that I've put a lot of focus on is how to handle mistakes and corrections when the model starts going off the rails. Changes are accumulated in a protected sandbox separate from your project files, a diff review TUI is included that allows for bad changes to be rejected, all actions are version-controlled so you can easily go backwards and try a different approach, and branches are also included for trying out multiple approaches.I think nailing this developer-AI feedback loop is the key to getting authentic productivity gains. We shouldn't just ask how well a coding tool can pass benchmarks, but what the failure case looks like when things go wrong.

评论 #39912811 未加载

评论 #39911350 未加载

评论 #39912604 未加载

Madmallardabout 1 year ago

What veterans in the field know that AI hasn’t tackled is that the majority of difficulty in development is dealing with complexity and ambiguity and a lot of it has to do with communication between people in natural language as well as reasoning in natural language about your system. These things are not solved by AI as it is now. If you can fully specify what you want with all of the detail and corner cases and situation handling then at some point AI might be able to make all of that for you. Great! Unfortunately, that’s the actual hard part! Not the implementation generally.

tibbettsabout 1 year ago

But can their AI quietly introduce a security exploit into a GitHub project?

评论 #39913185 未加载

aussieguy1234about 1 year ago

12% fix rate = 88% bug rate

评论 #39911317 未加载

barfbagginusabout 1 year ago

I would like something like this that helps me, as a green developer, find open source projects to contribute to.For instance, I recently learned about how to replace setup.py with pyproject.toml for a large number of projects. I also learned how to publish packages to pypi. These changes significantly improve project ease and accessibility, and are very easy to do.The main thing that holds people back is that python packaging documentation is notoriously cryptic - well I've already paid that cost, and now it's easy!So I'm thinking of finding projects that are healthy, but haven't focused on modernizing their packaging or distributing their project through pypi.I'd build human + agent based tooling to help me find candidates, propose the improvement to existing maintainers, then implement and deliver.I could maybe upgrade 100 projects, then write up the adventure.Anyone have inspiration/similar ideas, and wanna brainstorm?

评论 #39912966 未加载

readthenotes1about 1 year ago

I made a lot of money as I was paid hourly while working with a cadre of people I called "the defect generators".I'm kind of sad that future generations will not have that experience...

pjmlpabout 1 year ago

Eventually it will be 90% fix rate and everyone cheering for the 12% will be flipping burgers instead.

评论 #39914574 未加载

评论 #39916848 未加载

评论 #39922420 未加载

iLoveOncallabout 1 year ago

And creates how many new ones?This and Devin generate garbage code that will make any codebase worse.It's a joke that 12.5% is even associated with the word "success".

评论 #39910223 未加载

unit_circleabout 1 year ago

A 1/8 chance of fixing a bug at the cost of a careful review and some corrections is not bad.0% -> 12% improvement is not bad for two years either (I'm somewhat arbitrary picking the release date of ChatGPT). If this can be kept up for a few years we will have some extremely useful tooling. The cost can be relatively high as well, since engineering time is currently orders of magnitude more expensive than these tools.

评论 #39910686 未加载

评论 #39911021 未加载

评论 #39910857 未加载

评论 #39915127 未加载

评论 #39911046 未加载

trebligdivadabout 1 year ago

So this issues arbitrary shell commands based on trying to understand the untrusted bug text ? Should be fun waiting until someone finds an escape.

Frummyabout 1 year ago

Interesting idea to provide the Agent-Computer Interface for it to scroll and such, interact easier from its perspective

评论 #39911193 未加载

mdanielabout 1 year ago

I think that "Demo" link is just an extremely annoying version of an HTML presentation, so they could save me a shitload of clicking if they just dumped their presentation out to a PDF or whatever so I could read faster than watching it type out text as if it was live. It also whines a lot on the console about its inability to connect to a websocket server on 3000 but I don't know what it would do with a websocket connection if had it

评论 #39915176 未加载

noncomlabout 1 year ago

If you are afraid that LLMs will replace you at your job, ask an LLM to write Rust code for reading a utf8 file character by characterEdit: Yes, it does write some code that is "close" enough, but in some cases it is wrong, in others it doesn't not do exactly what asked. I.e. needs supervision from someone who understands both the requirements, the code and the problems that may arise from the naive line that the LLM is taking. Mind you, the most popular the issue, the better the line LLM is taking. So in other words, IMHO is a glorified Stack Overflow. Just as there are engineers that copy-paste from SO without having any idea what the code does, there will be engineers that will just copy paste from LLM. Their work will be much better than if they used SO, but I think it's still nowhere to the mark of a Senior SWE and above.

评论 #39910386 未加载

评论 #39910371 未加载

评论 #39912138 未加载

评论 #39910314 未加载

评论 #39910642 未加载

sumeruchatabout 1 year ago

Once we have this fully automated, any good developer could have a team of 100 robo SWEs and ship like crazy. The real competition is with those devs not with the bots.

评论 #39911526 未加载

评论 #39911042 未加载