TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Killed by LLM – I catalogued AI benchmarks we thought would last years

17 点作者 robkop5 个月前

3 条评论

a1o5 个月前
There is a square with arrow like the Volvo logo that suggests things are clickable (to get more information?) , I tried to tap the turing test one in different places but nothing happened - I am on iPhone with Safari.
评论 #42579703 未加载
robkop5 个月前
For my year end I collected data on on how quickly AI benchmarks are becoming obsolete (<a href="https:&#x2F;&#x2F;r0bk.github.io&#x2F;killedbyllm&#x2F;" rel="nofollow">https:&#x2F;&#x2F;r0bk.github.io&#x2F;killedbyllm&#x2F;</a>). Some interesting findings:<p>2023: GPT-4 was truely something new - It didn&#x27;t just beat SOTA scores, it completely saturated several benchmarks - First time humanity created something that can beat the turing test - Created a clear &quot;before&#x2F;after&quot; divide<p>2024: Others caught up, progress in fits and spurts - O1&#x2F;O3 used test-time compute to saturate math and reasoning benchmarks - Sonnet 3.5&#x2F; 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation - Llama 3&#x2F; Qwen 2.5 brought Open Weight models to be competitive across the board<p>And yet with all these saturated benchmarks, I personally still can&#x27;t trust a model to do the same work as a junior - our benchmarks aren&#x27;t yet measuring real-world reliability.<p>Data &amp; sources (if you&#x27;d like to contribute): <a href="https:&#x2F;&#x2F;github.com&#x2F;R0bk&#x2F;killedbyllm">https:&#x2F;&#x2F;github.com&#x2F;R0bk&#x2F;killedbyllm</a> Interactive timeline: <a href="https:&#x2F;&#x2F;r0bk.github.io&#x2F;killedbyllm&#x2F;" rel="nofollow">https:&#x2F;&#x2F;r0bk.github.io&#x2F;killedbyllm&#x2F;</a><p>P.S. I&#x27;ve had a hard time deciding what benchmarks are significant enough to include. If you know of other benchmarks (including those yet to be saturated) that help answer &quot;can AI do X&quot; questions then please let me know.
detente185 个月前
It&#x27;s interesting to see how short-lived some of these were (e.g. HumanEval).