科技回声

3 条评论

a1o5 个月前

There is a square with arrow like the Volvo logo that suggests things are clickable (to get more information?) , I tried to tap the turing test one in different places but nothing happened - I am on iPhone with Safari.

评论 #42579703 未加载

robkop5 个月前

For my year end I collected data on on how quickly AI benchmarks are becoming obsolete (<a href="https://r0bk.github.io/killedbyllm/" rel="nofollow">https://r0bk.github.io/killedbyllm/</a>). Some interesting findings:2023: GPT-4 was truely something new - It didn't just beat SOTA scores, it completely saturated several benchmarks - First time humanity created something that can beat the turing test - Created a clear "before/after" divide2024: Others caught up, progress in fits and spurts - O1/O3 used test-time compute to saturate math and reasoning benchmarks - Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation - Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the boardAnd yet with all these saturated benchmarks, I personally still can't trust a model to do the same work as a junior - our benchmarks aren't yet measuring real-world reliability.Data & sources (if you'd like to contribute): <a href="https://github.com/R0bk/killedbyllm">https://github.com/R0bk/killedbyllm</a> Interactive timeline: <a href="https://r0bk.github.io/killedbyllm/" rel="nofollow">https://r0bk.github.io/killedbyllm/</a>P.S. I've had a hard time deciding what benchmarks are significant enough to include. If you know of other benchmarks (including those yet to be saturated) that help answer "can AI do X" questions then please let me know.

detente185 个月前

It's interesting to see how short-lived some of these were (e.g. HumanEval).

3 条评论

a1o5 个月前

评论 #42579703 未加载

robkop5 个月前

detente185 个月前

It's interesting to see how short-lived some of these were (e.g. HumanEval).

Show HN: Killed by LLM – I catalogued AI benchmarks we thought would last years

3 条评论

Show HN: Killed by LLM – I catalogued AI benchmarks we thought would last years

3 条评论