The extremely interesting part is that 3.5 Sonnet is above o1 on this benchmark, which again shows that 3.5 Sonnet is a very special model that's best for real world tasks and not some one-shot scripts or math. And the weirdest part is that they tested the 20240620 snapshot which is objectively worse on code than the newer 20241022 (so-called v2).
I hire software engineers off Upwork. Part of our process is a 1-hour screening take home question that we ask people to solve. We always do a main one and an alternate for each role. I've tested all of ours on each of the main models and none have been able to solve any of the screening questions yet.
First time commenter - I was so triggered by this benchmark, so I just had to come out of lurking.<p>I've spent time going over the description and the cases and its an misrepresented travesty.<p>The benchmark takes existing cases from Upwork, then <i>reintroduces the problems</i> back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.<p>Lets look at some of the cases:<p>1. The regex zip code validation problem<p>Looking at the Upwork problem - <a href="https://github.com/Expensify/App/issues/14958">https://github.com/Expensify/App/issues/14958</a> it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.<p>The "reintroduced bug" - <a href="https://github.com/openai/SWELancer-Benchmark/blob/main/issues/14958/bug_reintroduce.patch">https://github.com/openai/SWELancer-Benchmark/blob/main/issu...</a> is just taking that new code and adding , to two countries....<p>2. Room showing empty - 14857<p>The "reintroduced bug" - <a href="https://github.com/openai/SWELancer-Benchmark/blob/main/issues/14857/bug_reintroduce.patch">https://github.com/openai/SWELancer-Benchmark/blob/main/issu...</a><p>Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...<p>I could go on and on and on...<p>The "extensive tests" are also laughable :(<p>I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.
It looks like they sourced tasks via a public Github repository, which is possibly part of the training dataset for the LLM. (It is not clear based on my scan whether the actual answers are also possibly in the public corpus).<p>Does this work as an experiment if the questions under test were also used to train the LLMs?
> <i>By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.</i><p>What could be costed in an upwork or a mechanical turk task Value?<p><i>Task Centrality</i> or <i>Blockingness</i> estimation: precedence edges, tsort topological sort, graph metrics like centrality<p><i>Task Complexity</i> estimation: story points, planning poker, relative local complexity scales<p><i>Task Value</i> estimation: cost/benefit analysis, marginal revenue
And how do you evaluate if the task was completed correctly? There are nearly infinite ways to solve a given software dev problem, if the problem isn't trivial (and I hope they are not benchmarking trivial problems).
The writing is very clearly on the wall.<p>On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?
Can anyone explain how this research benefits humanity for OpenAI's mission?<p>OpenAI's AGI mission statement<p>> "By AGI we mean highly autonomous systems that outperform humans at most economically valuable work."<p><a href="https://openai.com/index/how-should-ai-systems-behave/" rel="nofollow">https://openai.com/index/how-should-ai-systems-behave/</a><p>I would have to admit some humility as I sort of brought this on myself [1]<p>> This is a fantastic idea.
Perhaps then this should be the next test for these SWE Agents, in the same manner as the 'Will Smith Eats Spaghetti" video tests<p><a href="https://news.ycombinator.com/item?id=43032191">https://news.ycombinator.com/item?id=43032191</a><p>But curiously the question is still valid.<p>Related:<p>Sam Altman: "50¢ of compute of a SWE Agent can yield "$500 or $5k of work."<p><a href="https://news.ycombinator.com/item?id=43032098">https://news.ycombinator.com/item?id=43032098</a><p><a href="https://x.com/vitrupo/status/1889720371072696554" rel="nofollow">https://x.com/vitrupo/status/1889720371072696554</a>