科技回声

11 条评论

The extremely interesting part is that 3.5 Sonnet is above o1 on this benchmark, which again shows that 3.5 Sonnet is a very special model that's best for real world tasks and not some one-shot scripts or math. And the weirdest part is that they tested the 20240620 snapshot which is objectively worse on code than the newer 20241022 (so-called v2).

评论 #43096307 未加载

评论 #43096322 未加载

评论 #43096133 未加载

评论 #43104050 未加载

评论 #43096856 未加载

CSMastermind3 个月前

I hire software engineers off Upwork. Part of our process is a 1-hour screening take home question that we ask people to solve. We always do a main one and an alternate for each role. I've tested all of ours on each of the main models and none have been able to solve any of the screening questions yet.

评论 #43096844 未加载

评论 #43103208 未加载

评论 #43098828 未加载

Snuggly733 个月前

First time commenter - I was so triggered by this benchmark, so I just had to come out of lurking.I've spent time going over the description and the cases and its an misrepresented travesty.The benchmark takes existing cases from Upwork, then reintroduces the problems back in the code and then asks the LLM to fix them testing against newly written 'comprehensive tests'.Lets look at some of the cases:1. The regex zip code validation problemLooking at the Upwork problem - <a href="https://github.com/Expensify/App/issues/14958">https://github.com/Expensify/App/issues/14958</a> it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.The "reintroduced bug" - <a href="https://github.com/openai/SWELancer-Benchmark/blob/main/issues/14958/bug_reintroduce.patch">https://github.com/openai/SWELancer-Benchmark/blob/main/issu...</a> is just taking that new code and adding , to two countries....2. Room showing empty - 14857The "reintroduced bug" - <a href="https://github.com/openai/SWELancer-Benchmark/blob/main/issues/14857/bug_reintroduce.patch">https://github.com/openai/SWELancer-Benchmark/blob/main/issu...</a>Adds code explicitly commented as introducing a "radical bug" and "intentionally returning an empty array"...I could go on and on and on...The "extensive tests" are also laughable :(I am not sure if OpenAI is actually aware of how great this "benchmark" is, but after so much fanfare - they should be.

评论 #43103248 未加载

评论 #43106950 未加载

runako3 个月前

It looks like they sourced tasks via a public Github repository, which is possibly part of the training dataset for the LLM. (It is not clear based on my scan whether the actual answers are also possibly in the public corpus).Does this work as an experiment if the questions under test were also used to train the LLMs?

评论 #43096859 未加载

westurner3 个月前

> By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.What could be costed in an upwork or a mechanical turk task Value?Task Centrality or Blockingness estimation: precedence edges, tsort topological sort, graph metrics like centralityTask Complexity estimation: story points, planning poker, relative local complexity scalesTask Value estimation: cost/benefit analysis, marginal revenue

bufferoverflow3 个月前

And how do you evaluate if the task was completed correctly? There are nearly infinite ways to solve a given software dev problem, if the problem isn't trivial (and I hope they are not benchmarking trivial problems).

评论 #43095410 未加载

moralestapia3 个月前

The writing is very clearly on the wall.On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?

评论 #43096599 未加载

评论 #43096398 未加载

评论 #43096332 未加载

评论 #43096118 未加载

评论 #43096291 未加载

评论 #43097172 未加载

评论 #43096430 未加载

评论 #43096406 未加载

comeonbro3 个月前

Models tested: o1, 4o (August 2024 version), 3.5 Sonnet (June 2024 version)Notably missing: o3Consult this graph and extrapolate: <a href="https://i.imgur.com/EOKhZpL.png" rel="nofollow">https://i.imgur.com/EOKhZpL.png</a>

评论 #43096847 未加载

neilv3 个月前

"SWE-Lancer", like, skewering SWEs with a lance?

评论 #43096239 未加载

评论 #43096136 未加载

ctoth3 个月前

Gonna lance them SWEs like a boil!

colesantiago3 个月前

Can anyone explain how this research benefits humanity for OpenAI's mission?OpenAI's AGI mission statement> "By AGI we mean highly autonomous systems that outperform humans at most economically valuable work."<a href="https://openai.com/index/how-should-ai-systems-behave/" rel="nofollow">https://openai.com/index/how-should-ai-systems-behave/</a>I would have to admit some humility as I sort of brought this on myself [1]> This is a fantastic idea. Perhaps then this should be the next test for these SWE Agents, in the same manner as the 'Will Smith Eats Spaghetti" video tests<a href="https://news.ycombinator.com/item?id=43032191">https://news.ycombinator.com/item?id=43032191</a>But curiously the question is still valid.Related:Sam Altman: "50¢ of compute of a SWE Agent can yield "$500 or $5k of work."<a href="https://news.ycombinator.com/item?id=43032098">https://news.ycombinator.com/item?id=43032098</a><a href="https://x.com/vitrupo/status/1889720371072696554" rel="nofollow">https://x.com/vitrupo/status/1889720371072696554</a>

评论 #43096713 未加载

11 条评论

Tiberium3 个月前

评论 #43096307 未加载

评论 #43096322 未加载

评论 #43096133 未加载

评论 #43104050 未加载

评论 #43096856 未加载

CSMastermind3 个月前

评论 #43096844 未加载

评论 #43103208 未加载

评论 #43098828 未加载

Snuggly733 个月前

评论 #43103248 未加载

评论 #43106950 未加载

runako3 个月前

评论 #43096859 未加载

westurner3 个月前

bufferoverflow3 个月前

评论 #43095410 未加载

moralestapia3 个月前

The writing is very clearly on the wall.On a non-pessimist note, I don't think the SWE role will disappear, but what's the best one could do to be prepared for this?

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

11 条评论

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

11 条评论