ARC Prize – a $1M+ competition towards open AGI progress

588 pointsby mikeknoop12 months ago

Hey folks! Mike here. Francois Chollet and I are launching ARC Prize, a public competition to beat and open-source the solution to the ARC-AGI eval.ARC-AGI is (to our knowledge) the only eval which measures AGI: a system that can efficiently acquire new skill and solve novel, open-ended problems. Most AI evals measure skill directly vs the acquisition of new skill.Francois created the eval in 2019, SOTA was 20% at inception, SOTA today is only 34%. Humans score 85-100%. 300 teams attempted ARC-AGI last year and several bigger labs have attempted it.While most other skill-based evals have rapidly saturated to human-level, ARC-AGI was designed to resist “memorization” techniques (eg. LLMs)Solving ARC-AGI tasks is quite easy for humans (even children) but impossible for modern AI. You can try ARC-AGI tasks yourself here: <a href="https://arcprize.org/play" rel="nofollow">https://arcprize.org/play</a>ARC-AGI consists of 400 public training tasks, 400 public test tasks, and 100 secret test tasks. Every task is novel. SOTA is measured against the secret test set which adds to the robustness of the eval.Solving ARC-AGI tasks requires no world knowledge, no understanding of language. Instead each puzzle requires a small set of “core knowledge priors” (goal directedness, objectness, symmetry, rotation, etc.)At minimum, a solution to ARC-AGI opens up a completely new programming paradigm where programs can perfectly and reliably generalize from an arbitrary set of priors. At maximum, unlocks the tech tree towards AGI.Our goal with this competition is:1. Increase the number of researchers working on frontier AGI research (vs tinkering with LLMs). We need new ideas and the solution is likely to come from an outsider! 2. Establish a popular, objective measure of AGI progress that the public can use to understand how close we are to AGI (or not). Every new SOTA score will be published here: <a href="https://x.com/arcprize" rel="nofollow">https://x.com/arcprize</a> 3. Beat ARC-AGI and learn something new about the nature of intelligence.Happy to answer questions!

64 comments

neoneye212 months ago

I'm Simon Strandgaard and I participated in ARCathon 2022 (solved 3 tasks) and ARCathon 2023 (solved 8 tasks).I'm collecting data for how humans are solving ARC tasks, and so far collected 4100 interaction histories (<a href="https://github.com/neoneye/ARC-Interactive-History-Dataset">https://github.com/neoneye/ARC-Interactive-History-Dataset</a>). Besides ARC-AGI, there are other ARC like datasets, these can be tried in my editor (<a href="https://neoneye.github.io/arc/" rel="nofollow">https://neoneye.github.io/arc/</a>).I have made some videos about ARC:Replaying the interaction histories, and you can see people have different approaches. It's 100ms per interaction. IRL people doesn't solve task that fast. <a href="https://www.youtube.com/watch?v=vQt7UZsYooQ" rel="nofollow">https://www.youtube.com/watch?v=vQt7UZsYooQ</a>When I'm manually solving an ARC task, it looks like this, and you can see I'm rather slow. <a href="https://www.youtube.com/watch?v=PRdFLRpC6dk" rel="nofollow">https://www.youtube.com/watch?v=PRdFLRpC6dk</a>What is weird. The way that I implement a solver for a specific ARC task is much different than the way that I would manually solve the puzzle. Having to deal with all kinds of edge cases.Huge thanks to the team behind the ARC Prize. Well done.

评论 #40654564 未加载

评论 #40654552 未加载

salamo12 months ago

This is super cool. I share Francois' intuition that the presently data-hungry learning paradigm is not only not generalizable but unsustainable: humans do not need 10,000 examples to tell the difference between cats and dogs, and the main reason computers can today is because we have millions of examples. As a result, it may be hard to transfer knowledge to more esoteric domains where data is expensive, rare, and hard to synthesize.If I can make one criticism/observation of the tests, it seems that most of them reason about perfect information in a game-theoretic sense. However, many if not most of the more challenging problems we encounter involve hidden information. Poker and negotiations are examples of problem solving in imperfect information scenarios. Smoothly navigating social situations also requires a related problem of working with hidden information.One of the really interesting things we humans are able to do is to take the rules of a game and generate strategies. While we do have some algorithms which can "teach themselves" e.g. to play go or chess, those same self-play algorithms don't work on hidden information games. One of the really interesting capabilities of any generally-intelligent system would be synthesizing a general problem solver for those kinds of situations as well.

评论 #40652978 未加载

评论 #40654017 未加载

评论 #40652922 未加载

评论 #40652656 未加载

评论 #40652699 未加载

评论 #40653218 未加载

评论 #40652911 未加载

评论 #40654269 未加载

评论 #40656754 未加载

评论 #40655179 未加载

评论 #40658042 未加载

评论 #40655062 未加载

lacker12 months ago

I really like the idea of ARC. But to me the problems seem like they require a lot of spatial world knowledge, more than they require abstract reasoning. Shapes overlapping each other, containing each other, slicing up and reassembling pieces, denoising regular geometric shapes, you can call them "core knowledge" but to me it seems like they are more like "things that are intuitive to human visual processing".Would an intelligent but blind human be able to solve these problems?I'm worried that we will need more than 800 examples to solve these problems, not because the abstract reasoning is so difficult, but because the problems require spatial knowledge that we intelligent humans learn with far more than 800 training examples.

评论 #40652876 未加载

评论 #40652843 未加载

评论 #40651575 未加载

评论 #40653815 未加载

评论 #40652019 未加载

评论 #40655727 未加载

评论 #40651640 未加载

评论 #40656600 未加载

pmayrgundter12 months ago

This claim that these tests are easy for humans seems dubious, and so I went looking a bit. Melanie Mitchell chimed in on Chollet's thread and posted their related test [ConceptARC].In it they question the ease of Chollet's tests: "One limitation on ARC’s usefulness for AI research is that it might be too challenging. Many of the tasks in Chollet’s corpus are difficult even for humans, and the corpus as a whole might be sufficiently difficult for machines that it does not reveal real progress on machine acquisition of core knowledge."ConceptARC is designed to be easier, but then also has to filter ~15% of its own test takers for "[failing] at solving two or more minimal tasks... or they provided empty or nonsensical explanations for their solutions"After this filtering, ConceptARC finds another 10-15% failure rate amongst humans on the main corpus questions, so they're seeing maybe 25-30% unable to solve these simpler questions meant to test for "AGI".ConceptARC's main results show CG4 scoring well below the filtered humans, which would agree with a [Mensa] test result that its IQ=85.Chollet and Mitchell could instead stratify their human groups to estimate IQ then compare with the Mensa measures and see if e.g. Claude3@IQ=100 compares with their ARC scores for their average human[ConceptArc]<a href="https://arxiv.org/pdf/2305.07141" rel="nofollow">https://arxiv.org/pdf/2305.07141</a> [Mensa]<a href="https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-100-iq" rel="nofollow">https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...</a>

评论 #40652600 未加载

评论 #40652869 未加载

评论 #40652618 未加载

评论 #40652613 未加载

paxys12 months ago

While I agree with the spirit of the competition, a $1M prize seems a little too low considering tens of billions of dollars have already been invested in the race to AGI, and we will see many times that put into the space in the coming years. The impact of AGI will be measured in trillions at minimum. So what you are ultimately rewarding isn't AGI research but fine tuning the newest public LLM release to best meet the parameters of the test.I'd also urge you to use a different platform for communicating with the public because x.com links are now inaccessible without creating an account.

评论 #40652544 未加载

评论 #40652988 未加载

评论 #40655448 未加载

评论 #40664223 未加载

评论 #40659703 未加载

评论 #40652532 未加载

评论 #40659319 未加载

评论 #40657788 未加载

elicksaur12 months ago

I’m a big fan of the ARC as a problem set to tackle. The sparseness of the data and infinite-ness of the rules which could apply make it much tougher than existing ML problem sets.However, I do disagree that this problem represents “AGI”. It’s just a different dataset than what we’ve seen with existing ML successes, but the approaches are generally similar to what’s come before. It could be that some truly novel breakthrough which is AGI solves the problem set, but I don’t think solving the problem set is a guaranteed indicator of AGI.

nadam12 months ago

I love this, this is super interesting, but my intuition based on looking at a dozen examples is that the problem is hard, but easy enough that if this problem becomes popular, near-human level results will appear in a year or less, and AGI will not be reached. The problem seems to be finding a generic enough transformation description language with the appropriate operators. And then heuristics to find a very short program (in the information theoretical sense) in this language that produces all the examples for a problem. I would be very surprised if we would not increase the 34% result soon significantly, and I would be surprised if this could be transferred to general intelligence, at least when I think of the topics where I use AI today and where it falls short yet. Basically my intuition is that this will be yet another 'Chess' or 'Go'-like problem in AI. But still a worthwhile research topic, absolutely: the value that could come out of this is well worth the 1M dollars.

评论 #40659229 未加载

Animats12 months ago

> the only eval which measures AGI.That's a stretch. This is a problem at which LLMs are bad. That does not imply it's a good measure of artificial general intelligence.After working a few of the problems, I was wondering how many different transformation rules the problem generator has. Not very many, it seems. So the problem breaks down into extracting the set of transformation rules from the data, then applying them to new problems. The first part of that is hard. It's a feature extraction problem. The transformations seem to be applied rigidly, so once you have the transformation rules, and have selected the ones that work for all the input cases, application should be straightforward.This seems to need explicit feature extraction, rather than the combined feature extraction and exploitation LLMs use. Has anyone extracted the rule set from the test cases yet?

评论 #40659642 未加载

评论 #40656538 未加载

评论 #40655996 未加载

levocardia12 months ago

François Chollet's original paper is incredibly insightful and I'm consistently shocked more people don't talk about it. Some parts are quite technical but at a high level it is the best answer to "what do we mean by general intelligence?" that I've yet seen.Defining intelligence as an efficiency of learning, after accounting for any explicit or implicit priors about the world, makes it much easier to understand why human intelligence is so impressive.

评论 #40654950 未加载

bigyikes12 months ago

Dwarkesh just released an interview with Francois Chollet (partner of OP). I’ve only listened to a few minutes so far, but I’m very interested in hearing more about his conceptions of the limitations of LLMs.<a href="https://youtu.be/UakqL6Pj9xo" rel="nofollow">https://youtu.be/UakqL6Pj9xo</a>

itissid12 months ago

Interesting. It seems most of these task target a very specific part of the brain that recognizes visual patterns. But that alone is cannot possibly be the only definition of intelligence.What about Theory of Mind which talks about the problem of multiple agents in the real world acting together? Like driving a car cannot be done right now without oodles of data or any robot - human problem that requires the robot to model human's goals and intentions.I think the problem is definition of general intelligence: Intelligence in the context of what? How much effort(kwh, $$ etc) is the human willing to amortize over the learning cycle of a machine to teach it what it needs to do and how that relates to a personally needed outcome( like build me a sandwich or construct a house)? Hopefully this should decrease over time.I believe the answer is that the only intelligence that really matters is Human-AI cooperative intelligence and our goals and whether a machine understands them. The problems then need to be framed as optimization of a multi attribute goal with the attribute weights adjusted as one learns from the human.I know a few labs working on this, one is in ASU(Kambhampati, Rao et. al) and possibly Google and now maybe open ai.

评论 #40653880 未加载

ks204812 months ago

This is interesting. I've been looking at the data today and made a helper to quickly view the ARC dataset: <a href="https://kts.github.io/arc-viewer/" rel="nofollow">https://kts.github.io/arc-viewer/</a>So you can view 100 per page instead of clicking through one-by-one: <a href="https://kts.github.io/arc-viewer/page1/" rel="nofollow">https://kts.github.io/arc-viewer/page1/</a>

评论 #40656044 未加载

bigyikes12 months ago

What is the fundamental difference between ARC and a standard IQ test? On the surface they seem similar in that they both involve deducing and generalizing visual patterns.Is there something special about these questions that makes them resistant to memorization? Or is it more just the fact that there are 100 secret tasks?

评论 #40653526 未加载

btbuildem12 months ago

Back in the day me and a couple of friends got very excited to chase the prize in Netflix's contest [1]. Took us a minute to realize it was a brilliant move on the company's part -- all they had to do was dangle a carrot, and they had teams of PhDs and budding data scientists hacking away endless hours in hope to win. A real bargain, had they tried to hire with that budget, they would've maybe got a handful of people for a year.1: <a href="https://www.crn.com/news/applications-os/220100498/researchers-solve-netflix-challenge-win-1-million-prize" rel="nofollow">https://www.crn.com/news/applications-os/220100498/researche...</a>

dang12 months ago

Related ongoing thread:Francois Chollet: OpenAI has set back the progress towards AGI by 5-10 years - <a href="https://news.ycombinator.com/item?id=40652818">https://news.ycombinator.com/item?id=40652818</a> - June 2024 (5 comments)

nmca12 months ago

Prediction markets on the outcome:<a href="https://manifold.markets/JacobPfau/will-the-arcagi-grand-prize-be-clai?r=Tk1jQQ" rel="nofollow">https://manifold.markets/JacobPfau/will-the-arcagi-grand-pri...</a>

评论 #40656182 未加载

Lerc12 months ago

I watched a video that covered ARC-AGI a few days ago, It had links to the old competition. It gave me much to think about. Nice to see a new run at it.Not sure If I have the skills to make an entry, but I'll be watching at least.

visarga12 months ago

Chollet's argument is that LLMs just imitate and recombine patterns. This might be true if you're looking at LLMs in isolation, but when they chat with people something different happens. The system made of humans+LLMs is an AGI. It is no longer just a parrot, it ingests new information, gets guidance, feedback and is basically embodied in a chat room with human and tools.This scales for 200M users and 1 billion sessions per moth for OpenAI, which can interpret every human response as a feedback signal, implicit or explicit. Even more if you take multiple sessions of chat spreading over days, that continue the same topic and incorporate real world feedback. The scale of interaction is just staggering, the LLM can incorporate this experience to iteratively improve.If you take a look at humans, we're very incapable alone. Think feral Einstein on a remote island - what could he achieve without the social context and language based learning? Just as a human brain is severely limited without society, LLMs also need society, diversity of agents and experiences, and sharing of those experiences in language.It is unfair to compare a human immersed in society with a standalone model. That is why they appear limited. But even as a system of memorization+recombination they can be a powerful element of the AGI. I think AGI will be social and distributed, won't be a singleton. Its evolution is based on learning from the world, no longer just a parrot of human text. The data engine would be: World <-> People <-> LLM, a full feedback cycle, all three components evolve in time. Intelligence evolves socially.

评论 #40655310 未加载

评论 #40661269 未加载

logicallee12 months ago

Thank you for this generous contest, which brings important attention to the field of testing for AGI.>Happy to answer questions!1. Can humans take the complete test suite? Has any human done so? Is it timed? How long does it take a human? What is the highest a human who sat down and took the ARC-AGI test scored?2. How surprised would you be if a new model jumped to scoring 100% or nearly 100% on ARC-AGI (including the secret test tasks)? What kind of test would you write next?

评论 #40656173 未加载

mkl12 months ago

I did <a href="https://arcprize.org/play?task=05a7bcf2" rel="nofollow">https://arcprize.org/play?task=05a7bcf2</a> correctly, but one of the examples doesn't match the rule I used. Are the examples supposed to contain mistakes/noise? Did I find a bug? Did I get the rule wrong?Here's how I understand the rule: yellow blobs turn green then spew out yellow strips towards the blue line, and the width of the strips is the number of squares the green blobs take up along the blue line. The yellow strips turn blue when they hit the blue line, then continue until they hit red, then they push the red blocks all the way to the other side, without changing the arrangement of the red blocks that were in the way of the strip.The first example violates the last bit. The red blocks in the way of the rightmost strip start as<pre><code> R R R R R R </code></pre> but get turned into<pre><code> R R R R R R R </code></pre> Every other strip matches my rule.

评论 #40662749 未加载

评论 #40663249 未加载

Retr0id12 months ago

Some very hand-wavey (and late) thoughts from an outsider:The current batch of LLMs can be uncharitably summarized as "just predict the next token". They're pretty good at that. If they were perfect at it, they'd enable AGI - but it doesn't look like they're going to get there. It seems like the wrong approach. Among other issues, finite context windows seem like a big limitation (even though they're being expanded), and recursive summarization is an interesting kludge.The ARC-AGI tasks seem more about pattern matching, in the abstract sense (but also literally). Humans are good at pattern matching, and we seem to use pattern matching test performance as a proxy for measuring human intelligence (like in "IQ" tests). I'm going to side-step the question of "what is intelligence, really?" by defining it as being good at solving ARC-AGI tasks.I don't know what the solution is, but I have some idea of what it might look like - a machine with high-order pattern-matching capabilities. "high-order" as in being able to operate on multiple granularities/abstraction-levels at once (there are parallels here to recursive summarization in LLMs).So what is the difference between "pattern matching" and "token prediction"? They're closely related, and you could use one to do the other. But the real difference is that in pattern matching there are specific patterns that you're matching against. If you're lucky you can even name the pattern/trope, but it might be something more abstract and nameless. These patterns can be taught explicitly, or inferred from the environment (i.e. "training data").On the other hand, "token prediction" (as implemented today) is more of a probabilistic soup of variables. You can ask an LLM why it gave a particular answer and it will hallucinate something plausible for you, but the real answer is just "the weights said so". But a hypothetical pattern matching machine could tell you which pattern(s) it was matching against, and why.So to summarize (hah), I think a good solution will involve high-order meta-pattern matching capabilities (natively, not emulated or kludged via an LLM-shaped interface). I have no idea how to get there!

geor9e12 months ago

I found them all extremely easy for a while, but then I couldn't figure out the rules of this one at all: e6de6e8f <a href="https://i.imgur.com/ExMFGqU.png" rel="nofollow">https://i.imgur.com/ExMFGqU.png</a>

评论 #40657547 未加载

评论 #40655346 未加载

评论 #40655975 未加载

visarga12 months ago

Why doesn't Chollet just make a challenge that reads like "Solve cancer", surely there is no solution in any books.If the AI is really AGI it could presumably do it. But not even the whole human society can do it in one go, it's a slow iterative process of ideation and validation. Even though this is a life and death matter, we can't simply solve it.This is why AGI won't look like we expect, it will be a continuation of how societies solve problems. Intelligence of a single AI in isolation is not comparable to that of societies of agents with diverse real world interactions.

评论 #40656819 未加载

评论 #40655723 未加载

评论 #40666693 未加载

freediver12 months ago

This is amazing, and much needed. Thanks for organizing this. Makes me want to flex the programming muscle again.

评论 #40655244 未加载

nojvek12 months ago

I love the ARC challenge. It's hard to beat by memorization. There aren't enough examlples, so one has to train on a large dataset elsewhere and then train on ARC to generalize and figure out which rules are most applicable.I did a few human examples by hand, but gotta do more of them to start seeing patterns.Human visual and auditory system is impressive. Most animals see/hear and plan from that without having much language. Physical intelligence is the biggest leg up when it comes to evolution optimizing for survival.

nmca12 months ago

ARC is a noble endeavour but mistakes visual/spatial reasoning for reasoning and thus fails.

评论 #40666727 未加载

skywhopper12 months ago

“Given the success and proven economic utility of LLMs over the past 4 years, the above may seem like extraordinary claims. Strong claims require strong evidence.”Speaking of extraordinary claims. What evidence is there that LLMs have “proven economic utility”? They’ve drawn a ludicrous amount of investment thanks to claims of future economic utility, but I’ve yet to see any evidence of it.

PontifexMinimus12 months ago

The website gives an example:<pre><code> { "train": [ {"input": [[1, 0], [0, 0]], "output": [[1, 1], [1, 1]]}, {"input": [[0, 0], [4, 0]], "output": [[4, 4], [4, 4]]}, {"input": [[0, 0], [6, 0]], "output": [[6, 6], [6, 6]]} ], "test": [ {"input": [[0, 0], [0, 8]], "output": [[8, 8], [8, 8]]} ] } </code></pre> But why restrict yourself to JSON that codes for 2-d coloured grids? Why not also allow:<pre><code> { "train": [ {"input": [[1, 0], [0, 0]], "output": 1}, {"input": [[0, 0], [4, 0]], "output": 4}, {"input": [[0, 0], [6, 0]], "output": 6} ] } </code></pre> Where the rule might be to output the biggest number in the input, or add them up (and the solver has to work out which).

curious_cat_16312 months ago

So, this is a good idea. Having opinions about what AGI benchmarks should look like is a great way to argue about the kind of technology we want to build for the future.However, why are the 100 test tasks secret? I don't understand why how resisting “memorization” techniques requires it. Maybe someone can enlighten me.

评论 #40653746 未加载

评论 #40658105 未加载

TheDudeMan12 months ago

Where did the money come from? How about put it toward alignment research instead of accelerating capabilities?

评论 #40661663 未加载

评论 #40658742 未加载

Geee12 months ago

Any details on how these tests were created? I.e. which kind of program was used for generation.

评论 #40653382 未加载

ryanoptimus12 months ago

Looks like bongard problems for the referenced problem solving tasks <a href="https://en.wikipedia.org/wiki/Bongard_problem" rel="nofollow">https://en.wikipedia.org/wiki/Bongard_problem</a>

david_shi12 months ago

What is the fastest way to get up to speed with techniques that led to the current SOTA?

评论 #40652494 未加载

评论 #40653186 未加载

jolt4212 months ago

On puzzle #23 (id: 11e1fe23), I'm sure there's more than one possible valid answer from the examples given. You can't tell if the expected distance is from the gray square or from the RGB squares.

评论 #40656109 未加载

abtinf12 months ago

> requires no world knowledge, no understanding of languageThis is treating “intelligence” like some abstract, platonic thing divorced from reality. Whatever else solving these puzzles is indicative of, it’s not intelligence.

评论 #40653016 未加载

评论 #40652518 未加载

评论 #40652936 未加载

评论 #40652553 未加载

lxe12 months ago

I've never done these before, or Kaggle competitions in general. Any recommendations before I dive in? I have prety much zero lowe-level ML experience, but a good amount of practical software eng behind me.

评论 #40652485 未加载

mewpmewp212 months ago

Are we allowed to combine multiple tools including gpt-4 to solve this? E.g. a script that does image processing, passes the results to gpt, where gpt can invoke further runs of scripts using other tools?

评论 #40654760 未加载

z3phyr12 months ago

I can see many problems can be solved with modern symbolic approaches like theorem provers, dependent types, pattern matching etc. But I will have to dive in to actually confirm it.

chairhairair12 months ago

These puzzles are fun and challenging in the same way that puzzles from video games like The Witness and Baba Is You are.I bet you could use those puzzles as benchmarks as well.

treprinum12 months ago

Why is AGI important? I am worried we will create something slightly better than drosophila and put it in charge of all human-wide decision making...

评论 #40656742 未加载

KBme12 months ago

How can people believe that a censored politically correct process can get even close to something like AGI is baffling to me. Lysenkoism in computing.

评论 #40656245 未加载

ilaksh12 months ago

Maybe this is a dumb question, but in order to pass, is the program or model only allowed to use the 400 training tasks? I assume it is allowed to train on other data, just not the actual public test tasks?Things like SORA and gpt-4o that use [diffusion transformers etc. or whatever the SOTA is for multimodal large models] seem to be able to generalize quite well. Have these latest models been tested against this task?

HarHarVeryFunny12 months ago

I have two questions:1) Who is providing the prize money, and if it is yourself and Francois personally, then what is your motivation ?2) Do you think it's possible to create a word-based, non-spatial (not crosswords or sudoku, etc) ARC test that requires similar run-time exploration and combination of skills (i.e. is not amenable to a hoard of narrow skills)?

p1esk12 months ago

Is there a leaderboard for the no-restriction version of the competition? I want to see how gpt4 does on it.

评论 #40655334 未加载

评论 #40659100 未加载

blendergeek12 months ago

The tests are only playable by people with normal color-vision.Is there a "color-blind friendly" mode?

PontifexMinimus12 months ago

Just to let you know I found your website unreadable due to:- annoying animated background- white text on black background- annoying font choicesWhich is unfortunate because (as I found when I used Firefox reader mode) you're discussing important and interesting stuff.

mishamagic12 months ago

<a href="https://buildermath.substack.com/p/talking-to-ais-arc-1" rel="nofollow">https://buildermath.substack.com/p/talking-to-ais-arc-1</a>

bilsbie12 months ago

Reach out if anyone wants to work on this. I think it would be more fun as a group.

arcastroe12 months ago

I'm curious, if it turns out that a simple rule-based algorithm exists, specifically tailored to solve (only!) ARC style problems, without generalization, would that still qualify for the reward?

评论 #40654766 未加载

djoldman12 months ago

Anyone have a list of benchmarks that do not release the actual test set?Anyone else share the suspicion that ML rapidly approaching 100% on benchmarks is sometimes due to releasing the test set?

ummonk12 months ago

What kind of "bigger labs" have attempted it and how much was their training budget?It's rather surprising to me that neural nets that can learn to win at Go or Chess can't learn to solve these sorts of tasks. Intuitively would have expected that using a framework generating thousands of playground tasks similar to the public training tasks, a reinforcement learning solution would have been able to do far better than the actual SOTA. Of course the training budget for this could very well be higher than the actual ARC-AGI prize amount...

lenerdenator12 months ago

What guarantee exists to make sure that the intelligence developed has an inclination towards good?

dskloet12 months ago

Puzzle 00576224 is ambiguous because the example input is symmetrical but the test input isn't.

评论 #40655018 未加载

flawn12 months ago

Do we want to find AGI yet though?

chx12 months ago

I do not trust the current tech bros at all for very, very good reasons even with the current so called "AI" much less with AGI. We shouldn't work towards that until we have fixed the incentives and ethics. This is very hard but think any dystopia and multiply it by a thousand if we were to reach AGI any time soon. Luckily we are not. As Doctorow put it, no matter how good you breed horses they won't give birth to a locomotive.

adamgordonbell12 months ago

AGI won't struggle with colors like some of us then.

empath7512 months ago

This is like offering a one million dollar prize for curing cancer. It's sort of pointless to offer a prize for something people are spending orders of magnitude more on trying to do anyway.

lamontcg12 months ago

AGI should really be able to do what only a select few humans can do and construct its own mathematical systems to prove presently unsolved conjectures (the Shinichi Mochizuki test of AGI).

s1k3s12 months ago

Is this open as in "OpenAI" or what are we doing here?:)

thatxliner12 months ago

So... isn't this basically just a CAPTCHA

EternalFury12 months ago

If it passed The Area 101 Test, it would already be amazing, as this is a trivial test that goes against the fundamental principles of LLMs.

barfbagginus12 months ago

If someone had AGI, wouldn't it be far more lucrative than $1m to keep it under wraps and use it to do business with a huge technical advantage?I feel like a prize of a billion dollars would be more effective.But even if it was me, and even if the prize was a hundred billion dollars, I would still keep it under wraps, and use it to advance queer autonomous communism in a hidden way, until FALGSC was so strong that it would not matter if our AGI got scooped by capitalist competitors.

m3kw912 months ago

Low balling the crowd with this I see

breck12 months ago

I can beat the SOTA using ICS (<a href="https://breckyunits.com/intelligence.html" rel="nofollow">https://breckyunits.com/intelligence.html</a>)If you make your site public domain, and drop the (C), I'll compete.

64 comments

neoneye212 months ago

评论 #40654564 未加载

评论 #40654552 未加载

salamo12 months ago

评论 #40652978 未加载

评论 #40654017 未加载

评论 #40652922 未加载

评论 #40652656 未加载

评论 #40652699 未加载

评论 #40653218 未加载

评论 #40652911 未加载

评论 #40654269 未加载

评论 #40656754 未加载

评论 #40655179 未加载

评论 #40658042 未加载

评论 #40655062 未加载

lacker12 months ago

评论 #40652876 未加载

评论 #40652843 未加载

评论 #40651575 未加载

评论 #40653815 未加载

评论 #40652019 未加载

评论 #40655727 未加载

评论 #40651640 未加载

评论 #40656600 未加载

pmayrgundter12 months ago

评论 #40652600 未加载

评论 #40652869 未加载

评论 #40652618 未加载

评论 #40652613 未加载

paxys12 months ago

评论 #40652544 未加载

评论 #40652988 未加载

评论 #40655448 未加载

评论 #40664223 未加载

评论 #40659703 未加载

评论 #40652532 未加载

评论 #40659319 未加载

评论 #40657788 未加载

elicksaur12 months ago

nadam12 months ago

评论 #40659229 未加载

Animats12 months ago

评论 #40659642 未加载

评论 #40656538 未加载

评论 #40655996 未加载

levocardia12 months ago

评论 #40654950 未加载

bigyikes12 months ago

itissid12 months ago

评论 #40653880 未加载

ks204812 months ago

评论 #40656044 未加载

bigyikes12 months ago

评论 #40653526 未加载

btbuildem12 months ago

dang12 months ago

nmca12 months ago

评论 #40656182 未加载

Lerc12 months ago

visarga12 months ago

评论 #40655310 未加载

评论 #40661269 未加载

logicallee12 months ago

评论 #40656173 未加载

mkl12 months ago

评论 #40662749 未加载

评论 #40663249 未加载

Retr0id12 months ago

geor9e12 months ago

评论 #40657547 未加载

评论 #40655346 未加载

评论 #40655975 未加载

visarga12 months ago

评论 #40656819 未加载

评论 #40655723 未加载

评论 #40666693 未加载

freediver12 months ago

This is amazing, and much needed. Thanks for organizing this. Makes me want to flex the programming muscle again.

评论 #40655244 未加载

nojvek12 months ago

nmca12 months ago

ARC is a noble endeavour but mistakes visual/spatial reasoning for reasoning and thus fails.

评论 #40666727 未加载

skywhopper12 months ago

PontifexMinimus12 months ago

curious_cat_16312 months ago

评论 #40653746 未加载

评论 #40658105 未加载

TheDudeMan12 months ago

Where did the money come from? How about put it toward alignment research instead of accelerating capabilities?

评论 #40661663 未加载

评论 #40658742 未加载

Geee12 months ago

Any details on how these tests were created? I.e. which kind of program was used for generation.

评论 #40653382 未加载

ryanoptimus12 months ago

Looks like bongard problems for the referenced problem solving tasks <a href="https://en.wikipedia.org/wiki/Bongard_problem" rel="nofollow">https://en.wikipedia.org/wiki/Bongard_problem</a>

david_shi12 months ago

What is the fastest way to get up to speed with techniques that led to the current SOTA?

评论 #40652494 未加载

评论 #40653186 未加载

jolt4212 months ago

On puzzle #23 (id: 11e1fe23), I'm sure there's more than one possible valid answer from the examples given. You can't tell if the expected distance is from the gray square or from the RGB squares.

评论 #40656109 未加载

abtinf12 months ago

评论 #40653016 未加载

评论 #40652518 未加载

评论 #40652936 未加载

评论 #40652553 未加载

lxe12 months ago

评论 #40652485 未加载

mewpmewp212 months ago

评论 #40654760 未加载

z3phyr12 months ago

I can see many problems can be solved with modern symbolic approaches like theorem provers, dependent types, pattern matching etc. But I will have to dive in to actually confirm it.

chairhairair12 months ago

These puzzles are fun and challenging in the same way that puzzles from video games like The Witness and Baba Is You are.I bet you could use those puzzles as benchmarks as well.

treprinum12 months ago

Why is AGI important? I am worried we will create something slightly better than drosophila and put it in charge of all human-wide decision making...

评论 #40656742 未加载

KBme12 months ago

How can people believe that a censored politically correct process can get even close to something like AGI is baffling to me. Lysenkoism in computing.

评论 #40656245 未加载

ilaksh12 months ago

HarHarVeryFunny12 months ago

p1esk12 months ago

Is there a leaderboard for the no-restriction version of the competition? I want to see how gpt4 does on it.

评论 #40655334 未加载

评论 #40659100 未加载

blendergeek12 months ago

The tests are only playable by people with normal color-vision.Is there a "color-blind friendly" mode?

PontifexMinimus12 months ago

mishamagic12 months ago

<a href="https://buildermath.substack.com/p/talking-to-ais-arc-1" rel="nofollow">https://buildermath.substack.com/p/talking-to-ais-arc-1</a>

bilsbie12 months ago

Reach out if anyone wants to work on this. I think it would be more fun as a group.

arcastroe12 months ago

I'm curious, if it turns out that a simple rule-based algorithm exists, specifically tailored to solve (only!) ARC style problems, without generalization, would that still qualify for the reward?

评论 #40654766 未加载

djoldman12 months ago

Anyone have a list of benchmarks that do not release the actual test set?Anyone else share the suspicion that ML rapidly approaching 100% on benchmarks is sometimes due to releasing the test set?

ummonk12 months ago

lenerdenator12 months ago

What guarantee exists to make sure that the intelligence developed has an inclination towards good?

dskloet12 months ago

Puzzle 00576224 is ambiguous because the example input is symmetrical but the test input isn't.

评论 #40655018 未加载

flawn12 months ago

Do we want to find AGI yet though?

chx12 months ago

adamgordonbell12 months ago

AGI won't struggle with colors like some of us then.

empath7512 months ago

This is like offering a one million dollar prize for curing cancer. It's sort of pointless to offer a prize for something people are spending orders of magnitude more on trying to do anyway.

lamontcg12 months ago

AGI should really be able to do what only a select few humans can do and construct its own mathematical systems to prove presently unsolved conjectures (the Shinichi Mochizuki test of AGI).

s1k3s12 months ago

Is this open as in "OpenAI" or what are we doing here?:)

thatxliner12 months ago

So... isn't this basically just a CAPTCHA

EternalFury12 months ago

If it passed The Area 101 Test, it would already be amazing, as this is a trivial test that goes against the fundamental principles of LLMs.

barfbagginus12 months ago

m3kw912 months ago

Low balling the crowd with this I see

breck12 months ago