Author here -- six months ago we launched ARC Prize, a huge $1M experiment, to test if we need new ideas for AGI. The ARC-AGI benchmark remains unbeaten and I think we can now definitely say "yes".<p>One big update since June is that progress is no longer stalled. Coming into 2024, the public consensus vibe was that pure deep learning / LLMs would continue scaling to AGI. The fundamental architecture of these systems hasn't changed since ~2019.<p>But this flipped late summer. AlphaProof and o1 are evidence of this new reality. All frontier AI systems are now incorporating components beyond pure deep learning like program synthesis and program search.<p>I believe ARC Prize played a role here too. All the winners this year are leveraging new AGI reasoning approaches like deep-learning guided program synthesis, and test-time training/fine-tuning. We'll be seeing a lot more of these in frontier AI systems in coming years.<p>And I'm proud to say that all the code and papers from this year's winners are now open source!<p>We're going to keep running this thing annually until its defeated. And we've got ARC-AGI-2 in the works to improve on several of the v1 flaws (more here: <a href="https://arcprize.org/blog/arc-prize-2024-winners-technical-report" rel="nofollow">https://arcprize.org/blog/arc-prize-2024-winners-technical-r...</a>)<p>The ARC-AGI community keeps surprising me. From initial launch, through o1 testing, to the final 48 hours when the winning team jumped 10% and both winning papers dropped out of nowhere. I'm incredibly grateful to everyone and we will do our best to steward this attention towards AGI.<p>We'll be back in 2025!
What surprises me about this is how poorly general-purpose LLMs do. The best one is OpenAI o1-preview at 18%. This is significantly worse than the purpose-built models like ARChitects (which scored 53.5). This model used TTT to train on the ARC-AGI task specification (amoung other things). It seems that even if someone creates a model that can "solve" ARC, it still is not indicative of AGI since it is not "general" anymore, it is just specialized to this particular task. Similar to how chess engines are not AGI, despite being superhuman at chess. It will be much more convincing when general models not trained specifically for ARC can still score well on it.<p>They do mention that some of the tasks here are susceptible to brute force and they plan to address that in ARC-AGI-2.<p>> nearly half (49%) of the private evaluation set was solved by at least one team during the original 2020 Kaggle competition all of which were using some variant of brute-force program search. This suggests a large fraction of ARC-AGI-1 tasks are susceptible to this kind of method and does not carry much useful signal towards general intelligence.
The first question I still have is what happened to core knowledge priors. The white paper that introduced ARC made a big todo about how core knowledge priors are necessary to solve ARC tasks but from what I can tell none of the best-performing (or at-all performing) systems have anything to do with core knowlege priors.<p>So what happened to that assumption? Is it dead?<p>The second question I still have is about the defenses of ARC against memorisation-based, big-data approaches. I note that the second best system is based on an LLM with "test time training" where the first two steps are:<p><pre><code> initial finetuning on similar tasks
auxiliary task format and augmentations
</code></pre>
Which is to say, a data augmentation approach. With big data comes great responsibility and the authors of the second-best system don't disappoint: they claim that by training on more examples they achieve reasoning.<p>So what happened to the claim that ARC is secure against big-data approaches? Is it dead?
I'm unable to figure out how to solve current Daily Puzzle (Puzzle ID: 79369cc6) at <a href="https://arcprize.org/play" rel="nofollow">https://arcprize.org/play</a><p>Either I'm really dumb or the test is getting into captcha-like territory where humans aren't really good at solving/deciphering the test anymore.
Were there any interesting non-neural approaches? I was wondering whether there is any underlying structure in the ARC tasks that could tell us something about algorithms for "reasoning" problems in general.
Reasons that I can't take this benchmark seriously:<p>1. Existing brute force algorithms solve 40% of this "reasoning" and "generalization" test.<p>2. AGI must evidently fit on a single 16GB, decade-old GPU?<p>3. If ARC fails blind people, it's not a reasoning test. Reasoning is independent of visual acuity. So ARC is at best a vision processing <i>then</i> reasoning test. SotA model "failure" is meaningless. ("But what about the other format, JSON?" Yeah, I would <i>love</i> to see the human solve rate on that...)
I'm a little surprised by the seeming enthusiasm in the report for TTT as an approach. The results speak for themselves and TTT seems like a powerful approach. But the dependence on large amounts of synthetic pre-training data seems to contradict the philosophical ideas behind the competition.