TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Killed by LLM – I catalogued AI benchmarks we thought would last years

17 pointsby robkop5 months ago

3 comments

a1o5 months ago
There is a square with arrow like the Volvo logo that suggests things are clickable (to get more information?) , I tried to tap the turing test one in different places but nothing happened - I am on iPhone with Safari.
评论 #42579703 未加载
robkop5 months ago
For my year end I collected data on on how quickly AI benchmarks are becoming obsolete (<a href="https:&#x2F;&#x2F;r0bk.github.io&#x2F;killedbyllm&#x2F;" rel="nofollow">https:&#x2F;&#x2F;r0bk.github.io&#x2F;killedbyllm&#x2F;</a>). Some interesting findings:<p>2023: GPT-4 was truely something new - It didn&#x27;t just beat SOTA scores, it completely saturated several benchmarks - First time humanity created something that can beat the turing test - Created a clear &quot;before&#x2F;after&quot; divide<p>2024: Others caught up, progress in fits and spurts - O1&#x2F;O3 used test-time compute to saturate math and reasoning benchmarks - Sonnet 3.5&#x2F; 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation - Llama 3&#x2F; Qwen 2.5 brought Open Weight models to be competitive across the board<p>And yet with all these saturated benchmarks, I personally still can&#x27;t trust a model to do the same work as a junior - our benchmarks aren&#x27;t yet measuring real-world reliability.<p>Data &amp; sources (if you&#x27;d like to contribute): <a href="https:&#x2F;&#x2F;github.com&#x2F;R0bk&#x2F;killedbyllm">https:&#x2F;&#x2F;github.com&#x2F;R0bk&#x2F;killedbyllm</a> Interactive timeline: <a href="https:&#x2F;&#x2F;r0bk.github.io&#x2F;killedbyllm&#x2F;" rel="nofollow">https:&#x2F;&#x2F;r0bk.github.io&#x2F;killedbyllm&#x2F;</a><p>P.S. I&#x27;ve had a hard time deciding what benchmarks are significant enough to include. If you know of other benchmarks (including those yet to be saturated) that help answer &quot;can AI do X&quot; questions then please let me know.
detente185 months ago
It&#x27;s interesting to see how short-lived some of these were (e.g. HumanEval).