TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

New code-focused LLM needle in the haystack benchmark

6 pointsby sumanyusharmaabout 1 year ago

1 comment

sumanyusharmaabout 1 year ago
Hi HN - In collab with UWaterloo, we published a new code-focused needle in the haystack benchmark.<p>TLDR - GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length. - The hype is real. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. - Llama3-70B reached GPT 3.5-Turbo levels (yay open-source!) - Gemini 1.0 Pro was bad across the board (super surprising)<p>Link to repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;HammingHQ&#x2F;bug-in-the-code-stack">https:&#x2F;&#x2F;github.com&#x2F;HammingHQ&#x2F;bug-in-the-code-stack</a> See full results here: <a href="https:&#x2F;&#x2F;hamming.ai&#x2F;blog&#x2F;bug-in-the-codestack">https:&#x2F;&#x2F;hamming.ai&#x2F;blog&#x2F;bug-in-the-codestack</a>