> Over the last two years, we’ve more or less run out of benchmarks where AI isn’t better than humans.<p>this whole "benchmarks" thing is laughable. I've been using Gemini all week to do code assist, review patches, etc. Such impressive text, lists of bullets, suggestions, etc., but then at the same time it makes tons of mistakes, which you then call it on, and it predictably is like "oh sorry! of course!" yes of COURSE. because all it does is guess what word is most likely to come after the previous word. is there a "benchmark" for "doesn't hallucinate made up BS?" because humans can do very well on such a benchmark.