OpenAI o1 Results on ARC-AGI-Pub

187 pointsby z78 months ago

18 comments

In my opinion this blog post is a little bit misleading about the difference between o1 and earlier models. When I first heard about ARC-AGI (a few months ago, I think) I took a few of the ARC tasks and spent a few hours testing all the most powerful models. I was kind of surprised by how completely the models fell on their faces, even with heavy-handed feedback and various prompting techniques. None of the models came close to solving even the easiest puzzles. So today I tried again with o1-preview, and the model solved (probably the easiest) puzzle without any kind of fancy prompting:<a href="https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fabca" rel="nofollow">https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...</a>Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.

评论 #41537375 未加载

评论 #41537409 未加载

评论 #41537669 未加载

评论 #41537303 未加载

Stevvo8 months ago

"Greenblatt" shown with 42% in the bar chart is GPT-4o with a strategy: <a href="https://substack.com/@ryangreenblatt/p-145731248" rel="nofollow">https://substack.com/@ryangreenblatt/p-145731248</a>So, how well might o1 do with Greenblatt's strategy?

评论 #41537419 未加载

w48 months ago

> o1's performance increase did come with a time cost. It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.Sheesh. We're going to need more compute.

评论 #41537670 未加载

评论 #41537070 未加载

fsndz8 months ago

As expected, I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. <a href="https://www.lycee.ai/blog/openai-o1-release-agi-reasoning" rel="nofollow">https://www.lycee.ai/blog/openai-o1-release-agi-reasoning</a>

评论 #41536918 未加载

评论 #41536799 未加载

评论 #41536960 未加载

GaggiX8 months ago

It really shows how far ahead Anthropic is/was when they released Claude 3.5 Sonnet.That being said, the ARC-agi test is mostly a visual test that would be much easier to beat when these models will truly be multimodal (not just appending a separate vision encoder after training) in my opinion.I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

评论 #41536940 未加载

alphabetting8 months ago

This is best AGI benchmark out there in my opinion. Surprising results that underscore how good Sonnet is.

评论 #41536815 未加载

评论 #41536050 未加载

mrcwinn8 months ago

How is Anthropic accomplishing this despite (seemingly) arriving later?What advantage do they have?

评论 #41538549 未加载

评论 #41537506 未加载

评论 #41537254 未加载

fancyfredbot8 months ago

I found the level headed explanation of why log linear improvements in test score with increased compute aren't revolutionary the best part of this article. That's not to say the rest wasn't good too! One of the best articles on o1 I've read.

benreesman8 months ago

The test you really want is the apples-to-apples comparison between GPT-4o faced with the same CoT and other context annealing that presumably, uh, Q* sorry Strawberry now feeds it (on your dime). This would of course require seeing the tokens you are paying for instead of being threatened with bans for asking about them.Compared to the difficulty in assembling the data and compute and other resources needed to train something like GPT-4-1106 (which are staggering), training an auxiliary model with a relatively straightforward, differentiable, well-behaved loss on a task like "which CoT framing is better according to human click proxy" is just not at that same scale.

Terretta8 months ago

TL;DR (direct quote):“In summary, o1 represents a paradigm shift from "memorize the answers" to "memorize the reasoning" but is not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in-distribution.”“We still need new ideas for AGI.”

评论 #41537295 未加载

ec1096858 months ago

Why is this considered such a great AGI test? It seems possible to extensively train a model on the algorithms used to solve these cases, and some cases feel beyond what a human could straightforwardly figure out.

评论 #41537458 未加载

评论 #41537278 未加载

评论 #41537451 未加载

a_wild_dandan8 months ago

This tests vision, not intelligence. A reasoning test dependent on noisy information is borderline useless.

评论 #41539359 未加载

lossolo8 months ago

It seems like o1 is a lot worse than Claude on coding tasks <a href="https://livebench.ai" rel="nofollow">https://livebench.ai</a>

perching_aix8 months ago

Is it possible for me, a human, to undertake these benchmarks?

评论 #41538315 未加载

Alifatisk8 months ago

This is a great marketing for Anthropic

meowface8 months ago

Takeaway:>o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.Scores:>GPT-4o: 9%>o1-preview: 21%>Claude 3.5 Sonnet: 21%>MindsAI: 46% (current highest score)

评论 #41536873 未加载

评论 #41536830 未加载

评论 #41538288 未加载

评论 #41538993 未加载

bulbosaur1238 months ago

Ok, I have a practical question. How do I use this o1 thing to view codebase for my game app and then simply add new features based on my prompts? Is it possible rn? How?

devit8 months ago

Am I missing something or this "ARC-AGI" thing is so ludicrously terrible that it seems to be completely irrelevant?It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.The problem is of course that the transformation is not specified, so any answer is actually acceptable since one can always come up with a justification for it, and thus there is no reasonable way to evaluate the model (other than only accepting the arbitrary answer that the authors pulled out of who knows where).It's like those stupid tests that tell you "1 2 3 ..." and you are supposed to complete with 4, but obviously that's absurd since any continuation is valid given that e.g. you can find a polynomial that passes for any four numbers, and the test maker didn't provide any objective criteria to determine which algorithm among multiple candidates is to be preferred.Basically, something like this is about guessing how the test maker thinks, which is completely unrelated to the concept of AGI (i.e. the ability to provide correct answers to questions based on objectively verifiable criteria).And if instead of AGI one is just trying to evaluate how the model predicts how the average human thinks, then it makes no sense at all to evaluate language model performance by performance on predicting colored grid transformations.For instance, since normal LLMs are not trained on colored grids, it means that any model specifically trained on colored grid transformations as performed by humans of similar "intelligence" as the ARC-"AGI" test maker is going to outperform normal LLMs at ARC-"AGI", despite the fact that it is not really a better model in general.

评论 #41539069 未加载