There is a square with arrow like the Volvo logo that suggests things are clickable (to get more information?) , I tried to tap the turing test one in different places but nothing happened - I am on iPhone with Safari.
For my year end I collected data on on how quickly AI benchmarks are becoming obsolete (<a href="https://r0bk.github.io/killedbyllm/" rel="nofollow">https://r0bk.github.io/killedbyllm/</a>). Some interesting findings:<p>2023: GPT-4 was truely something new
- It didn't just beat SOTA scores, it completely saturated several benchmarks
- First time humanity created something that can beat the turing test
- Created a clear "before/after" divide<p>2024: Others caught up, progress in fits and spurts
- O1/O3 used test-time compute to saturate math and reasoning benchmarks
- Sonnet 3.5/ 4o incremented some benchmarks into saturation, and pushed new visual evals into saturation
- Llama 3/ Qwen 2.5 brought Open Weight models to be competitive across the board<p>And yet with all these saturated benchmarks, I personally still can't trust a model to do the same work as a junior - our benchmarks aren't yet measuring real-world reliability.<p>Data & sources (if you'd like to contribute): <a href="https://github.com/R0bk/killedbyllm">https://github.com/R0bk/killedbyllm</a>
Interactive timeline: <a href="https://r0bk.github.io/killedbyllm/" rel="nofollow">https://r0bk.github.io/killedbyllm/</a><p>P.S. I've had a hard time deciding what benchmarks are significant enough to include. If you know of other benchmarks (including those yet to be saturated) that help answer "can AI do X" questions then please let me know.