TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork

111 pointsby zone4113 months ago

11 comments

Tiberium3 months ago
The extremely interesting part is that 3.5 Sonnet is above o1 on this benchmark, which again shows that 3.5 Sonnet is a very special model that's best for real world tasks and not some one-shot scripts or math. And the weirdest part is that they tested the 20240620 snapshot which is objectively worse on code than the newer 20241022 (so-called v2).
评论 #43096307 未加载
评论 #43096322 未加载
评论 #43096133 未加载
评论 #43104050 未加载
评论 #43096856 未加载
CSMastermind3 months ago
I hire software engineers off Upwork. Part of our process is a 1-hour screening take home question that we ask people to solve. We always do a main one and an alternate for each role. I've tested all of ours on each of the main models and none have been able to solve any of the screening questions yet.
评论 #43096844 未加载
评论 #43103208 未加载
评论 #43098828 未加载
Snuggly733 months ago
First time commenter - I was so triggered by this benchmark, so I just had to come out of lurking.<p>I&#x27;ve spent time going over the description and the cases and its an misrepresented travesty.<p>The benchmark takes existing cases from Upwork, then <i>reintroduces the problems</i> back in the code and then asks the LLM to fix them testing against newly written &#x27;comprehensive tests&#x27;.<p>Lets look at some of the cases:<p>1. The regex zip code validation problem<p>Looking at the Upwork problem - <a href="https:&#x2F;&#x2F;github.com&#x2F;Expensify&#x2F;App&#x2F;issues&#x2F;14958">https:&#x2F;&#x2F;github.com&#x2F;Expensify&#x2F;App&#x2F;issues&#x2F;14958</a> it was mainly that they were using a common regex to validate across all countries, so the solution had to introduce country specific regex etc.<p>The &quot;reintroduced bug&quot; - <a href="https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;SWELancer-Benchmark&#x2F;blob&#x2F;main&#x2F;issues&#x2F;14958&#x2F;bug_reintroduce.patch">https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;SWELancer-Benchmark&#x2F;blob&#x2F;main&#x2F;issu...</a> is just taking that new code and adding , to two countries....<p>2. Room showing empty - 14857<p>The &quot;reintroduced bug&quot; - <a href="https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;SWELancer-Benchmark&#x2F;blob&#x2F;main&#x2F;issues&#x2F;14857&#x2F;bug_reintroduce.patch">https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;SWELancer-Benchmark&#x2F;blob&#x2F;main&#x2F;issu...</a><p>Adds code explicitly commented as introducing a &quot;radical bug&quot; and &quot;intentionally returning an empty array&quot;...<p>I could go on and on and on...<p>The &quot;extensive tests&quot; are also laughable :(<p>I am not sure if OpenAI is actually aware of how great this &quot;benchmark&quot; is, but after so much fanfare - they should be.
评论 #43103248 未加载
评论 #43106950 未加载
runako3 months ago
It looks like they sourced tasks via a public Github repository, which is possibly part of the training dataset for the LLM. (It is not clear based on my scan whether the actual answers are also possibly in the public corpus).<p>Does this work as an experiment if the questions under test were also used to train the LLMs?
评论 #43096859 未加载
westurner3 months ago
&gt; <i>By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.</i><p>What could be costed in an upwork or a mechanical turk task Value?<p><i>Task Centrality</i> or <i>Blockingness</i> estimation: precedence edges, tsort topological sort, graph metrics like centrality<p><i>Task Complexity</i> estimation: story points, planning poker, relative local complexity scales<p><i>Task Value</i> estimation: cost&#x2F;benefit analysis, marginal revenue
bufferoverflow3 months ago
And how do you evaluate if the task was completed correctly? There are nearly infinite ways to solve a given software dev problem, if the problem isn&#x27;t trivial (and I hope they are not benchmarking trivial problems).
评论 #43095410 未加载
moralestapia3 months ago
The writing is very clearly on the wall.<p>On a non-pessimist note, I don&#x27;t think the SWE role will disappear, but what&#x27;s the best one could do to be prepared for this?
评论 #43096599 未加载
评论 #43096398 未加载
评论 #43096332 未加载
评论 #43096118 未加载
评论 #43096291 未加载
评论 #43097172 未加载
评论 #43096430 未加载
评论 #43096406 未加载
comeonbro3 months ago
Models tested: o1, 4o (August 2024 version), 3.5 Sonnet (June 2024 version)<p>Notably missing: o3<p>Consult this graph and extrapolate: <a href="https:&#x2F;&#x2F;i.imgur.com&#x2F;EOKhZpL.png" rel="nofollow">https:&#x2F;&#x2F;i.imgur.com&#x2F;EOKhZpL.png</a>
评论 #43096847 未加载
neilv3 months ago
&quot;SWE-Lancer&quot;, like, skewering SWEs with a lance?
评论 #43096239 未加载
评论 #43096136 未加载
ctoth3 months ago
Gonna lance them SWEs like a boil!
colesantiago3 months ago
Can anyone explain how this research benefits humanity for OpenAI&#x27;s mission?<p>OpenAI&#x27;s AGI mission statement<p>&gt; &quot;By AGI we mean highly autonomous systems that outperform humans at most economically valuable work.&quot;<p><a href="https:&#x2F;&#x2F;openai.com&#x2F;index&#x2F;how-should-ai-systems-behave&#x2F;" rel="nofollow">https:&#x2F;&#x2F;openai.com&#x2F;index&#x2F;how-should-ai-systems-behave&#x2F;</a><p>I would have to admit some humility as I sort of brought this on myself [1]<p>&gt; This is a fantastic idea. Perhaps then this should be the next test for these SWE Agents, in the same manner as the &#x27;Will Smith Eats Spaghetti&quot; video tests<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43032191">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43032191</a><p>But curiously the question is still valid.<p>Related:<p>Sam Altman: &quot;50¢ of compute of a SWE Agent can yield &quot;$500 or $5k of work.&quot;<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43032098">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43032098</a><p><a href="https:&#x2F;&#x2F;x.com&#x2F;vitrupo&#x2F;status&#x2F;1889720371072696554" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;vitrupo&#x2F;status&#x2F;1889720371072696554</a>
评论 #43096713 未加载