Hi HN - We built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases. (similar to a text-based needle-in-the-haystack test)<p>GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length.<p>GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. Gemini-1.0-pro performed the worst, surprisingly worse than Llama3-70B.<p>Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this.
Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model.<p>This benchmark has lots of limitations. I would love your feedback & suggestions on how we can make this benchmark more useful!<p>Link to results: <a href="https://hamming.ai/blog/bug-in-the-codestack">https://hamming.ai/blog/bug-in-the-codestack</a>
Repo: <a href="https://github.com/HammingHQ/bug-in-the-code-stack">https://github.com/HammingHQ/bug-in-the-code-stack</a>