TechEcho

1 comment

Hi HN - In collab with UWaterloo, we published a new code-focused needle in the haystack benchmark.<p>TLDR - GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length. - The hype is real. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. - Llama3-70B reached GPT 3.5-Turbo levels (yay open-source!) - Gemini 1.0 Pro was bad across the board (super surprising)<p>Link to repo: <a href="https://github.com/HammingHQ/bug-in-the-code-stack">https://github.com/HammingHQ/bug-in-the-code-stack</a> See full results here: <a href="https://hamming.ai/blog/bug-in-the-codestack">https://hamming.ai/blog/bug-in-the-codestack</a>

New code-focused LLM needle in the haystack benchmark

1 comment

New code-focused LLM needle in the haystack benchmark

1 comment