TechEcho

Hi HN - We built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases. (similar to a text-based needle-in-the-haystack test)GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. Gemini-1.0-pro performed the worst, surprisingly worse than Llama3-70B.Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this. Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model.This benchmark has lots of limitations. I would love your feedback & suggestions on how we can make this benchmark more useful!Link to results: <a href="https://hamming.ai/blog/bug-in-the-codestack">https://hamming.ai/blog/bug-in-the-codestack</a> Repo: <a href="https://github.com/HammingHQ/bug-in-the-code-stack">https://github.com/HammingHQ/bug-in-the-code-stack</a>

Nice work - this is a real benchmark for cases I use LLMs for

Can LLMs find bugs in large Python codebases?

2 comments

Can LLMs find bugs in large Python codebases?

2 comments