TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Can LLMs find bugs in large Python codebases?

5 pointsby sumanyusharmaabout 1 year ago

2 comments

sumanyusharmaabout 1 year ago
Hi HN - We built a new benchmark called &quot;Bug In The Code Stack&quot; (BICS) to test how well LLMs can find syntactic bugs in large Python codebases. (similar to a text-based needle-in-the-haystack test)<p>GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.<p>GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. Gemini-1.0-pro performed the worst, surprisingly worse than Llama3-70B.<p>Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this. Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model.<p>This benchmark has lots of limitations. I would love your feedback &amp; suggestions on how we can make this benchmark more useful!<p>Link to results: <a href="https:&#x2F;&#x2F;hamming.ai&#x2F;blog&#x2F;bug-in-the-codestack">https:&#x2F;&#x2F;hamming.ai&#x2F;blog&#x2F;bug-in-the-codestack</a> Repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;HammingHQ&#x2F;bug-in-the-code-stack">https:&#x2F;&#x2F;github.com&#x2F;HammingHQ&#x2F;bug-in-the-code-stack</a>
评论 #40418991 未加载
coder4lifeabout 1 year ago
Nice work - this is a real benchmark for cases I use LLMs for