Getting 50% (SoTA) on Arc-AGI with GPT-4o

394 点作者 tomduncalf11 个月前

26 条评论

(ARC Prize co-founder here).Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.A couple important notes:1. this result is on the public eval set vs private set (ARC Prize $).2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: <a href="https://arcprize.org/leaderboard" rel="nofollow">https://arcprize.org/leaderboard</a>EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

评论 #40714116 未加载

评论 #40712673 未加载

评论 #40713440 未加载

评论 #40712907 未加载

评论 #40718028 未加载

评论 #40715468 未加载

评论 #40715353 未加载

评论 #40714428 未加载

评论 #40714245 未加载

评论 #40716604 未加载

评论 #40715482 未加载

whiplash45111 个月前

The article jumps to the conclusion that "Given that current LLMs can perform decently well on ARC-AGI" after having used multiple hand-crafted tricks to get to these results, including "I also did a small amount of iteration on a 100 problem subset of the public test set" which is hidden in the middle of the article and not mentioned in the bullet list at the top.Adding the close-to ad-hominem attack on Francois Chollet with the comics at the beginning (Francois never claimed to be a neuro-symbolic believer), this work does a significant disservice to the community.

评论 #40715887 未加载

评论 #40716039 未加载

评论 #40716432 未加载

评论 #40718813 未加载

extr11 个月前

Very cool. When GPT-4 first came out I tried some very naive approaches using JSON representations on the puzzles [0], [1]. GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.At the time I noticed that many of the ARC problems rely on visual-spatial priors that are "obvious" when viewing the grids, but become less so when transmuted to some other representation. Many of them rely on some kind of symmetry, counting, or the very human bias to assume a velocity or continued movement when seeing particular patterns.I had always thought maybe multimodality was key: the model needs to have similar priors around grounded physical spaces and movement to be able to do well. I'm not sure the OP really fleshes this line of thinking out, brute forcing python solutions is a very "non human" approach.[0] <a href="https://x.com/eatpraydiehard/status/1632671307254099968" rel="nofollow">https://x.com/eatpraydiehard/status/1632671307254099968</a>[1] <a href="https://x.com/eatpraydiehard/status/1632683214329479169" rel="nofollow">https://x.com/eatpraydiehard/status/1632683214329479169</a>

评论 #40716335 未加载

评论 #40712644 未加载

atleastoptimal11 个月前

I'll say what a lot of people seem to be denying. GPT-4 is an AGI, just a very bad one. Even GPT-1 was an AGI. There isn't a hard boundary between non AGI and AGI. A lot of people wish there was so they imagine absolutes regarding LLM's like "they cannot create anything new" or something like that. Just think: we consider humans a general intelligence, but obviously wouldn't consider an embryo or infant a general intelligence. So at what point does a human go from not generally intelligent to generally intelligent? And I don't mean an age or brain size, I mean suite of testable abilities.Intelligence is an ability that is naturally gradual and emerges over many domains. It is a collection of tools via which general abstractive principles can be applied, not a singular universally applicable ability to think in abstractions. GPT-4, compared to a human, is a very very small brain trained for the single purpose of textual thinking with some image capabilities. Claiming that ARC is the absolute market of general intelligence fails to account for the big picture of what intelligence is.

评论 #40714191 未加载

评论 #40714565 未加载

评论 #40716518 未加载

评论 #40715248 未加载

评论 #40714189 未加载

评论 #40715346 未加载

评论 #40715384 未加载

asperous11 个月前

Having tons of people employ human ingenuity to manipulate existing LLMs into passing this one benchmark kind of defeats the purpose of testing for "AGI". The author points this out as it's more of a pattern matching test.Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.

评论 #40713120 未加载

评论 #40712503 未加载

评论 #40712555 未加载

评论 #40712632 未加载

评论 #40713156 未加载

Imnimo11 个月前

To me the big take-aways here are:1) Most of the heavy lifting is being done by search. We're talking about having the LLM generate thousands of candidate solutions, and they're mostly bad enough that "just pick the ones that get kinda close on the examples" is a meaningful operation.2) More samples improves performance despite the fact that GPT-4o's vision is not capable of parsing the inputs. I'm curious how much performance would degrade if you shuffled the images passed to the model (but used the correct images when evaluating which candidates to keep).3) It's definitely true that the LLM has to be giving you something more than random programs. At the very least, the LLM knows how to craft parsimonious programs that are more likely to be the solution. It may be that it's providing more than that, but it's not clear to me exactly how much information on the correct search space is coming from the hand-crafted examples in the prompt.Overall, the work to get this far is very impressive, but it doesn't really move the needle for me on whether GPT-4 can do ARC puzzles. It does, however, show me that search is surprisingly powerful on this task.

bearjaws11 个月前

Seems that Arc-AGI is more flawed rather than GPT-4o is more AGI.Maybe a AI version of Hanlons Razor. Never attribute to AGI what could be easily explained by being in the training set.

badrunaway11 个月前

When we talk about system 2; is it possible that [generating large number of programs; evaluating them of the task; choosing top K outcomes; feeding it back to Neural net] can act as system 2 for a AGI? Isn't that how we think intelligently as well- by making lot of hypothesis internally and evaluating them - and updating our model?

评论 #40713140 未加载

评论 #40721168 未加载

评论 #40713161 未加载

traject_11 个月前

We don't actually know if it is SOTA, the previous SOTA solution also got around the same on the evaluation set.

评论 #40712275 未加载

YeGoblynQueenne11 个月前

>> Claim 1 seems likely true to me for a reasonable notion of “learning”. I think François Chollet agrees here. Most of my doubts about this claim are concerns that you can basically brute force ARC-AGI without interestingly doing learning (e.g. brute-force search over some sort of DSL or training on a huge array of very similar problems). These concerns apply much less to the kind of approach I usedThe approach described in the article is exactly "brute-force search over some sort of DSL". The "DSL" is a model of Python syntax that GPT-4o has learned after training on the entire internet. This "DSL" is locked up in the black box of GPT-4o's weights, but just because no-one can see it, it doesn't mean it's not there; and we can see GPT-4o generating Python programs, so we know it is there, even if we don't know what it looks like.That DSL may not be "domain specific" in the sense of being specifically tailored to solve ARC-AGI tasks, or any other particular task, but it is "domain specific" in the sense of generating Python programs for some subset of all possible Python programs that includes programs that can solve some ARC-AGI tasks. That's a very broad category, but that's why it over-generates so much: it needs to draw 8k samples total until one works for just 50% of the public eval set.

rgbrgb11 个月前

> 50% accuracy on the public test set for ARC-AGI by having GPT-4oIsn't the public test set public on github and therefore GPT-4o trained on it?

评论 #40712472 未加载

评论 #40712401 未加载

eigenvalue11 个月前

The Arc stuff just felt intuitively wrong as soon as I heard it. I don't find any of Chollet's critiques of LLMs to be convincing. It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism. The problem is, the optimism really seems to be justified, and the rate of improvement of LLMs in the past 12 months has been nothing short of astonishing.So it's not at all surprising to me to see Arc already being mostly solved using existing models, just with different prompting techniques and some tool usage. At some point, the naysayers about LLMs are going to have to confront the problem that, if they are right about LLMs not really thinking/understanding/being sentient, then a very large percentage of people living today are also not thinking/understanding/sentient!

评论 #40712233 未加载

评论 #40712352 未加载

评论 #40712431 未加载

评论 #40712304 未加载

评论 #40712290 未加载

评论 #40712385 未加载

评论 #40714220 未加载

评论 #40712465 未加载

评论 #40712713 未加载

评论 #40713491 未加载

评论 #40713110 未加载

sparsely11 个月前

You can have a go at the problems here: <a href="https://arcprize.org/play?task=00576224" rel="nofollow">https://arcprize.org/play?task=00576224</a>None of them are terribly hard but some aren't trivial either, a couple took me a bit of thinking to work out. By far the most tedious part is inputting the result (I didn't bother after the first) which is definitely something AI is better at!

machiaweliczny11 个月前

This challenge looks quite solvable but it's relies on physics understanding and it's has a lot of human/world priors in sense of space understanding and object boundaries.Seems like it relies on identification of objects and then mapping them somehow. Most of the cases so far that I've seen are based on some transformation or relation between the objects.So far it seems like some search among common transformatiosn and relations could solve it. Plus some heuristics/computation for counting order, wholeness(boundary) or pattern.IMO it can be solved by search of programs that combine these + some LLM to guide heuristics most likely.The only hard one was applied noise or one testing understanding of "gravity".Did anyone test human baseline for this?

trott11 个月前

François Chollet says LLMs do not learn in-context. But Geoff Hinton says LLMs' few-shot learning compares quite favorably with people!<a href="https://www.youtube.com/watch?v=QWWgr2rN45o&t=46m20s" rel="nofollow">https://www.youtube.com/watch?v=QWWgr2rN45o&t=46m20s</a>The truth is in the middle, I think. They learn in-context, but not as well as humans.The approach in the article hides the unreliability of current LLMs by generating thousands of programs, and still the results aren't human-level. (This is impressive work though -- I'm not criticizing it.)

评论 #40714144 未加载

killerstorm11 个月前

FWIW GPT-4 is able to generate a plan very similar to one in the article: also involves feature extraction, program synthesis, iterative refinement.<a href="https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac909e" rel="nofollow">https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90...</a>So it's pretty close to being able to plan solution completely on its own. It's just rather bad at coding and visual inputs, so it doesn't know what it doesn't know.

TheDudeMan11 个月前

"Vision is an especially large weakness."But you can have GPT write code to reliably convert the image grid into a textual representation, right? And code to convert back to image and auto-verify.

评论 #40713429 未加载

评论 #40727801 未加载

nadam11 个月前

Amazing work, prompt engineering at its finest. One future direction for Arc AGI could be to use not Python, but a much more concise programming language that is more suited for brute-force methods like genetic mutations. The problem would be of course to train an LLM that is proficient enough in such a language. I am thinking about stack based languages. For this competition I would develop a careful bit-level encoding of a variant of the 'Joy' programming language. (<a href="https://en.wikipedia.org/wiki/Joy_(programming_language)" rel="nofollow">https://en.wikipedia.org/wiki/Joy_(programming_language)</a>) It would be a considerable effort though which I don't have time for, hence I post this idea publicly. A promising direction is a mix of things in my opinion: Special stack-based concise language, consulting LLMs like the OP did, and genetic algorithms combined.

评论 #40716693 未加载

htrp11 个月前

The expectation is that you'll have to have dynamically generated benchmarks with better eval at some point given the potential for brute forcing the private validation set.

greatpostman11 个月前

You know you’re approaching AGI when creating benchmarks gets difficult. This is only just beginning

评论 #40713061 未加载

评论 #40712761 未加载

bashfulpup11 个月前

I looked at the website and have no idea how Arc is supposed to be AGI.Can someone explain?

评论 #40712815 未加载

评论 #40712975 未加载

uptownfunk11 个月前

Arc agi is a small stepping stone to agi but is not agi.Program search mimics what humans do to a certain extent but not in entirety.A more general world model and reference will be required for agi.

bjornsing11 个月前

Can we be sure GPT-4o hasn’t been trained on the public test set?

评论 #40713748 未加载

gibsonf111 个月前

Isn't 50% kind of a failing grade?

评论 #40713447 未加载

cchance11 个月前

LOL i looked at that first complex test sample and closed the page, it made my brain hurt.

comfortabledoug11 个月前

I'm glad someone else finally said it, those born blind cannot possibly have AGI!/sarcasm :D