I'm less focused on the particular results here, and more that this is where we're at: measuring ML at ML. Imagine a future where we can't construct a benchmark that demonstrates humans outperform machines <i>at</i> machine learning. Sure, that doesn't mean they are actually better in the key respects, especially at creativity and induction. But that's still a hell of a stage to be at.<p>>consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts.<p>>the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. ><p>>However, humans currently display better returns to increasing time budgets