Some critical issues with the SWE-bench dataset

350 点作者 joshwa3 个月前

19 条评论

comex3 个月前

Some of the examples in the paper seem to be wrong.For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI.For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie.Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors.

评论 #43133439 未加载

评论 #43141154 未加载

评论 #43136204 未加载

评论 #43136615 未加载

评论 #43133398 未加载

评论 #43133448 未加载

评论 #43133425 未加载

评论 #43138133 未加载

modeless3 个月前

> When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.This matches my intuition about the coding performance of these models a lot better. I don't think any current coding benchmark accurately measures coding performance.

评论 #43131841 未加载

评论 #43131925 未加载

评论 #43132204 未加载

评论 #43131732 未加载

评论 #43131908 未加载

评论 #43133064 未加载

评论 #43133058 未加载

bearjaws3 个月前

I would argue almost every popular benchmark quoted by the big LLM companies is tainted.OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

评论 #43132570 未加载

评论 #43131983 未加载

评论 #43132881 未加载

评论 #43132568 未加载

ukFxqnLa2sBSBf63 个月前

There’s a few things I’m not understanding here.1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility.3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.

评论 #43132442 未加载

评论 #43131893 未加载

dang3 个月前

Submitted title was "SWE-Bench tainted by answer leakage; real pass rates significantly lower". Normally we'd replace that with the article title, in keeping with the site guideline ("Please use the original title, unless it is misleading or linkbait; don't editorialize."), but in this case the article title is so generic that this is arguably misleading as well, so I took a representative phrase from the abstract instead. That's preferable, because it's better to use the authors' own representation of their article.If anyone can find a better title (i.e. more accurate and neutral, preferably using language from the article itself) we can change it again.<a href="https://news.ycombinator.com/newsguidelines.html">https://news.ycombinator.com/newsguidelines.html</a>

semi-extrinsic3 个月前

So what we need is something like a versioned crowdsourced coding LLM eval dataset.Every quarter, you have a couple thousand volunteers provide 2 GitHub issues from the past 3 months, which are nontrivial to resolve, and where there exists strong test cases. Each volunteer then cross-checks 2 issues from other volunteers. The volunteers get 1 month free subscription to some AI service in return.This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.

评论 #43132161 未加载

评论 #43133154 未加载

评论 #43131810 未加载

optimalsolver3 个月前

You need benchmarks with the following three properties:1) No known solutions, so there's no "ground truth" dataset to train on2) Presumably hard to solve3) But easy to verify a solution if one is provided.This, of course, is easier done on the STEM side of things, but how do you automatically test creativity, or philosophical aptitude?

评论 #43131882 未加载

huac3 个月前

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.Looking at the benchmark, <a href="https://www.swebench.com/" rel="nofollow">https://www.swebench.com/</a>, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?

评论 #43131959 未加载

评论 #43131934 未加载

perrygeo3 个月前

The solution moving forward has to be private benchmark suites. I could see teams investing in their own set of programming challenges and periodically re-evaluating them - similar to how we would construct sets of live interview questions for candidates and qualitatively assess their ability.It's so vital that it's not leaked and that it's fit-for-purpose and manually assessed. These general purpose, public benchmarks based on questionable metrics are effectively worthless to assess real programming skill.Case in point, as others have mentioned here, Claude scores modestly on these benchmarks but vastly better than the alternatives in practice. I don't trust Claude fully but far more than OpenAI models; it's not even close. The IRL performance advantage is not reflected in any of these benchmarks.

brap3 个月前

My own impression with SoTA models is that they’re very useful for coding, yet they suck ass for solving unique problems (which is the case for every sufficiently large codebase).

MattDaEskimo3 个月前

There's a serious issue with benchmarks.Instead of resolving it, some leaders are further complicating their meaningSuch as OpenAI grading their benchmarks based on "how much money they made" or "how easy a model was convinced to hand over fake money".

otterley3 个月前

I am shocked—shocked—when a vendor cheats in order to increase their benchmark scores.I always tell my customers to ignore benchmarks and compare outcomes with their own workloads. Benchmarks are almost completely useless in the real world.

评论 #43131764 未加载

评论 #43131829 未加载

评论 #43131917 未加载

1024core3 个月前

To quote Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.Or, as in the case of LLMs and benchmarks: When a benchmark becomes a target, it ceases to be a good benchmark.

OldGreenYodaGPT3 个月前

> solutions were directly provided in the issue report or the commentsThis is fine, many of my real tickets already explain the solution. A good ticket often offers a solution or where to start looking.

评论 #43132741 未加载

ionwake3 个月前

I was wondering how long this would take to surface, you can tell a surprising amount just by carefully watching how the trainers answer interview questions, which is kinda meta really.

shayanh3 个月前

I found that this paper was submitted to ICLR, but got rejected: <a href="https://openreview.net/forum?id=pwIGnH2LHJ" rel="nofollow">https://openreview.net/forum?id=pwIGnH2LHJ</a>To me the analysis of SWE-Bench is a solid contribution and informative. My guess is that to meet conference's submission bar they had to come up with their own bench (SWE-Bench+), which wasn't thorough enough and the paper got rejected mainly because of that.

评论 #43132591 未加载

acc_2973 个月前

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.Is this what Hofstadter means by a strange-loop?

评论 #43131388 未加载

alalv3 个月前

Something weird (or at least uncommon) that has caught my attention and I havent seen mentioned in the comments is that they cite the swe-bench paper author by first name in the abstract, Carlos et al, and then by last name (as it is usually done) in the paper, Jimenez et al.

htrp3 个月前

Paper from October 2024