TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

OpenAI O3 breakthrough high score on ARC-AGI-PUB

1724 点作者 maurycy5 个月前

162 条评论

bluecoconut5 个月前
Efficiency is now key.<p>~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as &quot;ARC-AGI-TUNED&quot;, which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.<p>We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i&#x27;d argue a human is at 0.03USD - 1.67USD per puzzle at 20USD&#x2F;hr, and they include in their document an average mechancal turker at $2 USD task in their document)<p>Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k&#x2F;hr to 2.5M&#x2F;hr with current compute.<p>Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!
评论 #42474120 未加载
评论 #42476203 未加载
评论 #42478783 未加载
评论 #42477966 未加载
评论 #42476733 未加载
评论 #42473935 未加载
评论 #42479802 未加载
评论 #42482133 未加载
评论 #42477943 未加载
评论 #42482011 未加载
评论 #42479884 未加载
评论 #42474030 未加载
croemer5 个月前
The programming task they gave o3-mini high (creating Python server that allows chatting with OpenAI API and run some code in terminal) didn&#x27;t seem very hard? Strange choice of example for something that&#x27;s claimed to be a big step forwards.<p>YT timestamped link: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=SKBG1sqdyIU&amp;t=768s" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=SKBG1sqdyIU&amp;t=768s</a> (thanks for the fixed link @photonboom)<p>Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: <a href="https:&#x2F;&#x2F;claude.site&#x2F;artifacts&#x2F;36cecd49-0e0b-4a8c-befa-faa5aaa102e6" rel="nofollow">https:&#x2F;&#x2F;claude.site&#x2F;artifacts&#x2F;36cecd49-0e0b-4a8c-befa-faa5aa...</a>
评论 #42480174 未加载
评论 #42473701 未加载
评论 #42473738 未加载
评论 #42476501 未加载
评论 #42475843 未加载
评论 #42473769 未加载
评论 #42474095 未加载
modeless5 个月前
Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far.<p>A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.<p>It&#x27;s obvious to everyone that these models can&#x27;t perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.<p>We don&#x27;t need more &quot;hard&quot; benchmarks. What we need right now are &quot;easy&quot; benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!
评论 #42475261 未加载
评论 #42474781 未加载
评论 #42473597 未加载
评论 #42475198 未加载
评论 #42474745 未加载
评论 #42474960 未加载
评论 #42476242 未加载
评论 #42475136 未加载
评论 #42478318 未加载
评论 #42473449 未加载
评论 #42477656 未加载
评论 #42475820 未加载
obblekk5 个月前
Human performance is 85% [1]. o3 high gets 87.5%.<p>This means we have an algorithm to get to human level performance on this task.<p>If you think this task is an eval of general reasoning ability, we have an algorithm for that now.<p>There&#x27;s a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.<p>Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!<p>[1] <a href="https:&#x2F;&#x2F;x.com&#x2F;SmokeAwayyy&#x2F;status&#x2F;1870171624403808366" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;SmokeAwayyy&#x2F;status&#x2F;1870171624403808366</a>, <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2409.01374v1" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;html&#x2F;2409.01374v1</a>
评论 #42473526 未加载
评论 #42473582 未加载
评论 #42473627 未加载
评论 #42474245 未加载
评论 #42473573 未加载
评论 #42477740 未加载
评论 #42473783 未加载
评论 #42476351 未加载
评论 #42474659 未加载
评论 #42478696 未加载
nopinsight5 个月前
Let me go against some skeptics and explain why I think full o3 is pretty much AGI or at least embodies most essential aspects of AGI.<p>What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can&#x27;t either when they deal with unfamiliar domains, although that is not the case with these models.)<p>ARC has been challenging precisely because solving its problems often requires:<p><pre><code> 1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND 2) using the right level(s) of abstraction </code></pre> Achieving human-level performance in the ARC benchmark, <i>as well as</i> top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.<p>It might not <i>yet</i> be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.<p>[1] <a href="https:&#x2F;&#x2F;www.harvardlds.org&#x2F;wp-content&#x2F;uploads&#x2F;2017&#x2F;01&#x2F;SpelkeKinzler07-1.pdf" rel="nofollow">https:&#x2F;&#x2F;www.harvardlds.org&#x2F;wp-content&#x2F;uploads&#x2F;2017&#x2F;01&#x2F;Spelke...</a><p>ADDED:<p>Thanks to the link to Chollet&#x27;s posts by lswainemoore below. I&#x27;ve analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.<p>I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will <i>not</i> present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I&#x27;ve wondered if this is the purpose.)
评论 #42474046 未加载
评论 #42476406 未加载
评论 #42473982 未加载
评论 #42474189 未加载
评论 #42474040 未加载
评论 #42477074 未加载
评论 #42474241 未加载
评论 #42474782 未加载
评论 #42476959 未加载
评论 #42476099 未加载
评论 #42477889 未加载
评论 #42476051 未加载
评论 #42474365 未加载
评论 #42477129 未加载
评论 #42484207 未加载
评论 #42479080 未加载
评论 #42473947 未加载
sn0wr8ven5 个月前
Incredibly impressive. Still can&#x27;t really shake the feeling that this is o3 gaming the system more than it is actually being able to reason. If the reasoning capabilities are there, there should be no reason why it achieves 90% on one version and 30% on the next. If a human maintains the same performance across the two versions, an AI with reason should too.
评论 #42481022 未加载
评论 #42481802 未加载
评论 #42480758 未加载
评论 #42480937 未加载
评论 #42480554 未加载
评论 #42480575 未加载
评论 #42484208 未加载
评论 #42480654 未加载
w45 个月前
The cost to run the highest performance o3 model is estimated to be somewhere between $2,000 and $3,400 per task.[1] Based on these estimates, o3 costs about 100x what it would cost to have a human perform the exact same task. Many people are therefore dismissing the near-term impact of these models because of these extremely expensive costs.<p>I think this is a mistake.<p>Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.<p>Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?<p>There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they&#x27;d like to do but can&#x27;t because there just aren&#x27;t enough people available to do them. o3 might change that despite its high cost.<p>So <i>if</i> it is true that we&#x27;ve now got something like human-equivilant intelligence on demand - and that&#x27;s a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.<p>[1] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42473876">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42473876</a>
评论 #42475401 未加载
phil9175 个月前
Direct quote from the ARC-AGI blog:<p>“SO IS IT AGI?<p>ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we&#x27;ve repeated dozens of times this year. It&#x27;s a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.<p>Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don&#x27;t think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.<p>Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You&#x27;ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”<p>The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.<p>Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data &#x2F; answer contamination in those cases…
评论 #42474544 未加载
评论 #42474709 未加载
ripped_britches5 个月前
Sad to see everyone so focused on compute expense during this massive breakthrough. GPT-2 originally cost $50k to train, but now can be trained for ~$150.<p>The key part is that scaling test-time compute will likely be a key to achieving AGI&#x2F;ASI. Costs will definitely come down as is evidenced by precedents, Moore’s law, o3-mini being cheaper than o1 with improved performance, etc.
评论 #42477407 未加载
评论 #42476843 未加载
评论 #42477511 未加载
评论 #42477377 未加载
Imnimo5 个月前
Whenever a benchmark that was thought to be extremely difficult is (nearly) solved, it&#x27;s a mix of two causes. One is that progress on AI capabilities was faster than we expected, and the other is that there was an approach that made the task easier than we expected. I feel like the there&#x27;s a lot of the former here, but the compute cost per task (thousands of dollars to solve one little color grid puzzle??) suggests to me that there&#x27;s some amount of the latter. Chollet also mentions ARC-AGI-2 might be more resistant to this approach.<p>Of course, o3 looks strong on other benchmarks as well, and sometimes &quot;spend a huge amount of compute for one problem&quot; is a great feature to have available if it gets you the answer you needed. So even if there&#x27;s some amount of &quot;ARC-AGI wasn&#x27;t quite as robust as we thought&quot;, o3 is clearly a very powerful model.
评论 #42473651 未加载
评论 #42477208 未加载
hamburga5 个月前
I’m not sure if people realize what a weird test this is. They’re these simple visual puzzles that people can usually solve at a glance, but for the LLMs, they’re converted into a json format, and then the LLMs have to reconstruct the 2D visual scene from the json and pick up the patterns.<p>If humans were given the json as input rather than the images, they’d have a hard time, too.
评论 #42477945 未加载
评论 #42478300 未加载
评论 #42478485 未加载
评论 #42477810 未加载
评论 #42477577 未加载
aithrowawaycomm5 个月前
I would like to see this repeated with my highly innovative HARC-HAGI, which is ARC-AGI but it uses hexagons instead of squares. I suspect humans would only make slightly more brain farts on HARC-HAGI than ARC-AGI, but O3 would fail very badly since it almost certainly has been specifically trained on squares.<p>I am not really trying to downplay O3. But this would be a simple test as to whether O3 is truly &quot;a system capable of adapting to tasks it has never encountered before&quot; versus novel ARC-AGI tasks it hasn&#x27;t encountered before.
评论 #42476439 未加载
highfrequency5 个月前
Very cool. I recommend scrolling down to look at the example problem that O3 still can’t solve. It’s clear what goes on in the human brain to solve this problem: we look at one example, hypothesize a simple rule that explains it, and then check that hypothesis against the other examples. It doesn’t quite work, so we zoom into an example that we got wrong and refine the hypothesis so that it solves that sample. We keep iterating in this fashion until we have the simplest hypothesis that satisfies all the examples. In other words, how humans do science - iteratively formulating, rejecting and refining hypotheses against collected data.<p>From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3&#x2F;4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:<p>1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)<p>2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to <i>revise</i> the hypothesis in the simplest possible way that also explains this example.<p>3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.<p>4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.<p>5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.
评论 #42475051 未加载
zebomon5 个月前
My initial impression: it&#x27;s very impressive and very exciting.<p>My skeptical impression: it&#x27;s complete hubris to conflate ARC or any benchmark with truly general intelligence.<p>I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.<p>As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn&#x27;t have had to execute the program in the first place.<p>I&#x27;m not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I&#x27;m trying to temper it with a little reason.
评论 #42473481 未加载
评论 #42473576 未加载
评论 #42475506 未加载
评论 #42473478 未加载
评论 #42473650 未加载
评论 #42473645 未加载
评论 #42474145 未加载
评论 #42473640 未加载
评论 #42473678 未加载
评论 #42475810 未加载
miga895 个月前
How do the organisers keep the private test set private? Does openAI hand them the model for testing?<p>If they use a model API, then surely OpenAI has access to the private test set questions and can include it in the next round of training?<p>(I am sure I am missing something.)
评论 #42478143 未加载
评论 #42478121 未加载
评论 #42478990 未加载
评论 #42478413 未加载
评论 #42479396 未加载
评论 #42478860 未加载
Balgair5 个月前
Complete aside here: I used to do work with amputees and prosthetics. There is a standardized test (and I just cannot remember the name) that fits in a briefcase. It&#x27;s used for measuring the level of damage to the upper limbs and for prosthetic grading.<p>Basically, it&#x27;s got the dumbest and simplest things in it. Stuff like a lock and key, a glass of water and jug, common units of currency, a zipper, etc. It tests if you can do any of those common human tasks. Like pouring a glass of water, picking up coins from a flat surface (I chew off my nails so even an able person like me fails that), zip up a jacket, lock your own door, put on lipstick, etc.<p>We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. To the patients, the hands were therefore about as useful as a metal hook (a common solution with amputees today, not just pirates!).<p>Again, a total aside here, but your comment just reminded me of that brown briefcase. Life, it turns out, is a lot more complex than we give it credit for. Even pouring the OJ can be, in rare cases, transcendent.
评论 #42474389 未加载
评论 #42475434 未加载
评论 #42474790 未加载
评论 #42474401 未加载
评论 #42474657 未加载
评论 #42475097 未加载
评论 #42474695 未加载
评论 #42474384 未加载
评论 #42474497 未加载
评论 #42476076 未加载
评论 #42474381 未加载
评论 #42474430 未加载
评论 #42475804 未加载
评论 #42475603 未加载
tymonPartyLate5 个月前
Isn’t this like a brute force approach? Given it costs $ 3000 per task, thats like 600 GPU hours (h100 at Azure) In that amount of time the model can generate millions of chains of thoughts and then spend hours reviewing them or even testing them out one by one. Kind of like trying until something sticks and that happens to solve 80% of ARC. I feel like reasoning works differently in my brain. ;)
评论 #42479499 未加载
评论 #42480009 未加载
评论 #42479504 未加载
评论 #42479813 未加载
评论 #42479486 未加载
neuroelectron5 个月前
OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-AGI with their new o3 model<p>semi-private evals (100 tasks): 75.7% @ $2,012 total&#x2F;100 tasks (~$20&#x2F;task) with just 6 samples &amp; 33M tokens processed in ~1.3 min&#x2F;task and a cost of $2012<p>The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.<p>If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.<p>On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)<p>OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.
评论 #42474937 未加载
评论 #42474534 未加载
评论 #42474446 未加载
评论 #42475888 未加载
yawnxyz5 个月前
O3 High (tuned) model scored an 88% at what looks like $6,000&#x2F;task haha<p>I think soon we&#x27;ll be pricing any kind of tasks by their compute costs. So basically, human = $50&#x2F;task, AI = $6,000&#x2F;task, use human. If AI beats human, use AI? Ofc that&#x27;s considering both get 100% scores on the task
评论 #42473493 未加载
评论 #42473643 未加载
评论 #42473512 未加载
评论 #42473801 未加载
评论 #42473736 未加载
评论 #42474951 未加载
评论 #42474107 未加载
评论 #42473485 未加载
评论 #42473663 未加载
评论 #42473516 未加载
评论 #42473498 未加载
msoad5 个月前
There are new research where chain of thoughts is happening in latent spaces and not in English. They demonstrated better results since language is not as expressive as those concepts that can be represented in the layers before decoder. I wonder if o3 is doing that?
评论 #42479952 未加载
评论 #42479898 未加载
spaceman_20205 个月前
Just as an aside, I&#x27;ve personally found o1 to be completely useless for coding.<p>Sonnet 3.5 remains the king of the hill by quite some margin
评论 #42473648 未加载
评论 #42473511 未加载
评论 #42474898 未加载
评论 #42473790 未加载
评论 #42473712 未加载
评论 #42473970 未加载
评论 #42473501 未加载
评论 #42475439 未加载
评论 #42476684 未加载
评论 #42474222 未加载
botro5 个月前
The LLM community has come up with tests they call &#x27;Misguided Attention&#x27;[1] where they prompt the LLM with a slightly altered version of common riddles &#x2F; tests etc. This often causes the LLM to fail.<p>For example I used the prompt &quot;As an astronaut in China, would I be able to see the great wall?&quot; and since the training data for all LLMs is full of text dispelling the common myth that the great wall is visible from space, LLMs do not notice the slight variation that the astronaut is IN China. This has been a sobering reminder to me as discussion of AGI heats up.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;cpldcpu&#x2F;MisguidedAttention">https:&#x2F;&#x2F;github.com&#x2F;cpldcpu&#x2F;MisguidedAttention</a>
评论 #42476170 未加载
评论 #42490997 未加载
评论 #42490965 未加载
vicentwu5 个月前
&quot;Note on &quot;tuned&quot;: OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.&quot;<p>Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!
评论 #42479545 未加载
nxobject5 个月前
As an aside, I&#x27;m a little miffed that the benchmark calls out &quot;AGI&quot; in the name, but then heavily cautions that it&#x27;s necessary but insufficient for AGI.<p>&gt; ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI
评论 #42475824 未加载
评论 #42477652 未加载
Engineering-MD5 个月前
Can I just say what a dick move it was to do this as a 12 days of Christmas. I mean to be honest I agree with the arguments this isn’t as impressive as my initial impression, but they clearly intended it to be shocking&#x2F;a show of possible AGI, which is rightly scary.<p>It feels so insensitive to that right before a major holiday when the likely outcome is a lot of people feeling less secure in their career&#x2F;job&#x2F;life.<p>Thanks again openAI for showing us you don’t give a shit about actual people.
评论 #42477000 未加载
评论 #42476986 未加载
评论 #42478633 未加载
评论 #42486021 未加载
评论 #42478207 未加载
评论 #42480218 未加载
评论 #42478144 未加载
评论 #42477227 未加载
评论 #42477410 未加载
评论 #42477896 未加载
评论 #42477665 未加载
onemetwo5 个月前
In (1) the author use a technique to improve the performance of an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub benchmark moreover he said that more computer power would give better results. So the results of o3 could be produced in this way using the same method with more computer power, so if this is the case the result of o3 is not very interesting.<p>(1) <a href="https:&#x2F;&#x2F;params.com&#x2F;@jeremy-berman&#x2F;arc-agi" rel="nofollow">https:&#x2F;&#x2F;params.com&#x2F;@jeremy-berman&#x2F;arc-agi</a>
YeGoblynQueenne5 个月前
I just noticed this bit:<p>&gt;&gt; Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis.<p>&quot;Program synthesis&quot; is here used in an entirely idiosyncratic manner, to mean &quot;combining programs&quot;. Everyone else in CS and AI for the last many decades has used &quot;Program Synthesis&quot; to mean &quot;generating a program that satisfies a specification&quot;.<p>Note that &quot;synthesis&quot; can legitimately be used to mean &quot;combining&quot;. In Greek it translates literally to &quot;putting [things] together&quot;: &quot;Syn&quot; (plus) &quot;thesis&quot; (place). But while generating programs by combining parts of other programs is an old-fashioned way to do Program Synthesis, in the standard sense, the end result is always desired to be a program. The LLMs used in the article to do what F. Chollet calls &quot;Porgram Synthesis&quot; generate no code.
评论 #42480973 未加载
ndm0005 个月前
One thing I have not seen commented on is that ARC-AGI is a visual benchmark but LLMs are primarily text. For instance when I see one of the ARC-AGI puzzles, I have a visual representation in my brain and apply some sort of visual reasoning solve it. I can &quot;see&quot; in my mind&#x27;s eye the solution to the puzzle. If I didn&#x27;t have that capability, I don&#x27;t think I could reason through words how to go about solving it - it would certainly be much more difficult.<p>I hypothesize that something similar is going on here. OpenAI has not published (or I have not seen) the number of reasoning tokens it took to solve these - we do know that each tasks was thoussands of dollars. If &quot;a picture is worth a thousand words&quot;, could we make AI systems that can reason visually with much better performance?
评论 #42476941 未加载
评论 #42476107 未加载
oezi5 个月前
&gt; o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time<p>I don&#x27;t understand this mindset. We have all experienced that LLMs can produce words never spoken before. Thus there is recombination of knowledge at play. We might not be satisfied with the depth&#x2F;complexity of the combination, but there isn&#x27;t any reason to believe something fundamental is missing. Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.<p>The linked article says that LLMs are like a collection of vector programs. It has always been my thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out.
评论 #42480942 未加载
flakiness5 个月前
The cost axis is interesting. O3 Low is $10+ per task and 03 High is over $1000 (it&#x27;s logarithmic graph so it&#x27;s like $50 and $5000 respectively?)
t0lo5 个月前
I&#x27;m 22 and have no clue what I&#x27;m meant to do in a world where this is a thing. I&#x27;m moving to a semi rural, outdoorsy area where they teach data science and marine science and I can enjoy my days hiking, and the march of technology is a little slower. I know this will disrupt so much of our way of life, so I&#x27;m chasing what fun innocent years are left before things change dramatically.
评论 #42476258 未加载
评论 #42477218 未加载
评论 #42476522 未加载
评论 #42476263 未加载
评论 #42476342 未加载
评论 #42476310 未加载
评论 #42476539 未加载
评论 #42476291 未加载
评论 #42476794 未加载
评论 #42476253 未加载
评论 #42476280 未加载
mortehu5 个月前
The chart is super misleading, since the test was obscure until recently. A few months ago he announced he&#x27;d made the only good AGI test and offered a cash prize for solving it, only to find out in as much time that it&#x27;s no different from other benchmarks.
attentionmech5 个月前
Isn&#x27;t this at the level now where it can sort of self improve. My guess is that they will just use it to improve the model and the cost they are showing per evaluation will go down drastically.<p>So, next step in reasoning is open world reasoning now?
评论 #42478155 未加载
mukunda_johnson5 个月前
Deciphering patterns in natural language is more complex than these puzzles. If you train your AI to solve these puzzles, we end up in the same spot. The difficulty of solving would be with creating training data for a foreign medium. The &quot;tokens&quot; are the grids and squares instead of words (for words, we have the internet of words, solving that).<p>If we&#x27;re inferring the answers of the block patterns from minimal or no additional training, it&#x27;s very impressive, but how much time have they had to work on O3 after sharing puzzle data with O1? Seems there&#x27;s some room for questionable antics!
figure85 个月前
I have a very naive question.<p>Why is the ARC challenge difficult but coding problems are easy? The two examples they give for ARC (border width and square filling) are much simpler than pattern awareness I see simple models find in code everyday.<p>What am I misunderstanding? Is it that one is a visual grid context which is unfamiliar?
评论 #42481619 未加载
评论 #42480472 未加载
RivieraKid5 个月前
It sucks that I would love to be excited about this... but I mostly feel anxiety and sadness.
评论 #42475039 未加载
评论 #42474488 未加载
评论 #42473998 未加载
评论 #42478094 未加载
评论 #42474225 未加载
评论 #42476100 未加载
Bjorkbat5 个月前
I was impressed until I read the caveat about the high-compute version using 172x more compute.<p>Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.<p>The results are cool, but man, this sounds like such a busted approach.
评论 #42473795 未加载
SerCe5 个月前
&gt; You&#x27;ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.<p>You&#x27;ll know AGI is here when traditional captchas stop being a thing due to their lack of usefulness.
评论 #42476974 未加载
评论 #42477077 未加载
Seattle35035 个月前
How can there be &quot;private&quot; taks when you have use the OpenAI API to run queries? OpenAI sees everything.
评论 #42475942 未加载
edithpixie5 个月前
For many people and businesses, navigating the frequently dangerous landscape of financial loss can be an intimidating and overwhelming process. Nevertheless, the knowledgeable staff at Wizard Hilton Cyber Tech provides a ray of hope and direction with their indispensable range of services. Their offerings are based on a profound grasp of the far-reaching and terrible effects that financial setbacks, whether they be the result of cyberattacks, data breaches, or other unforeseen tragedies, can have. Their highly-trained analysts work tirelessly to assess the scope of the damage, identifying the root causes and developing tailored strategies to mitigate the fallout. From recovering lost or corrupted data to restoring compromised systems and securing networks, Wizard Hilton Cyber Tech employs the latest cutting-edge technologies and industry best practices to help clients regain their financial footing. But their support goes beyond the technical realm, as their compassionate case managers provide a empathetic ear and practical advice to navigate the emotional and logistical challenges that often accompany financial upheaval. With a steadfast commitment to client success, Wizard Hilton Cyber Tech is a trusted partner in weathering the storm of financial loss, offering the essential services and peace of mind needed to emerge stronger and more resilient than before.
energy1235 个月前
At about 12-14 minutes in OpenAI&#x27;s YouTube vid they show that o3-mini beats o1 on Codeforces despite using much less compute.
thisisthenewme5 个月前
I feel like AI is already changing how we work and live - I&#x27;ve been using it myself for a lot of my development work. Though, what I&#x27;m really concerned about is what happens when it gets smart enough to do pretty much everything better (or even close) than humans can. We&#x27;re talking about a huge shift where first knowledge workers get automated, then physical work too. The thing is, our whole society is built around people working to earn money, so what happens when AI can do most jobs? It&#x27;s not just about losing jobs - it&#x27;s about how people will pay for basic stuff like food and housing, and what they&#x27;ll do with their lives when work isn&#x27;t really a thing anymore. Or do people feel like there will be jobs safe from AI? (hopefully also fulfilling)<p>Some folks say we could fix this with universal basic income, where everyone gets enough money to live on, but I&#x27;m not optimistic that it&#x27;ll be an easy transition. Plus, there&#x27;s this possibility that whoever controls these &#x27;AGI&#x27; systems basically controls everything. We definitely need to figure this stuff out before it hits us, because once these changes start happening, they&#x27;re probably going to happen really fast. It&#x27;s kind of like we&#x27;re building this awesome but potentially dangerous new technology without really thinking through how it&#x27;s going to affect regular people&#x27;s lives. I feel like we need a parachute before we attempt a skydive. Some people feel pretty safe about their jobs and think they can&#x27;t be replaced. I don&#x27;t think that will be the case. Even if AI doesn&#x27;t take your job, you now have a lot more unemployed people competing for the same job that is safe from AI.
评论 #42475308 未加载
评论 #42476108 未加载
评论 #42475755 未加载
评论 #42475229 未加载
评论 #42475200 未加载
评论 #42475802 未加载
评论 #42476103 未加载
spyckie25 个月前
The more Hacker News worthy discussion is the part where the author talks about search through the possible mini-program space of LLMs.<p>It makes sense because tree search can be endlessly optimized. In a sense, LLMs turn the unstructured, open system of general problems into a structured, closed system of possible moves. Which is really cool, IMO.
评论 #42475349 未加载
smy200115 个月前
It seems O3 following trend of Chess engine that you can cut your search depth depends on state.<p>It&#x27;s good for games with clear signal of success (Win&#x2F;Lose for Chess, tests for programming). One of the blocker for AGI is we don&#x27;t have clear evaluation for most of our tasks and we cannot verify them fast enough.
roboboffin5 个月前
Interesting that in the video, there is an admission that they have been targeting this benchmark. A comment that was quickly shut down by Sam.<p>A bit puzzling to me. Why does it matter ?
评论 #42475936 未加载
评论 #42476350 未加载
hackpert5 个月前
If anyone else is curious about which ARC-AGI public eval puzzles o3 got right vs wrong (and its attempts at the ones it did get right), here&#x27;s a quick visualization: <a href="https:&#x2F;&#x2F;arcagi-o3-viz.netlify.app" rel="nofollow">https:&#x2F;&#x2F;arcagi-o3-viz.netlify.app</a>
whoistraitor5 个月前
The general message here seems to be that inference-time brute-forcing works as long as you have a good search and evaluation strategy. We’ve seemingly hit a ceiling on the base LLM forward-pass capability so any further wins are going to be in how we juggle multiple inferences to solve the problem space. It feels like a scripting problem now. Which is cool! A fun space for hacker-engineers. Also:<p>&gt; My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and &quot;execute&quot; it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.<p>I found this such an intriguing way of thinking about it.
评论 #42473882 未加载
asdf69695 个月前
Terrifying. This news makes me happy I save all my money. My only hope for the future is that I can retire early before I’m unemployable
评论 #42479180 未加载
sourcepluck5 个月前
Am I understanding correctly, and the only thing with a bit of actual data released so far is the ARC-AGI piece from Francois Chollet? And every other claim has no further data released on it?<p>Serious question. I&#x27;ve browsed around, looked for the official release, but it seems to be just hear-say for now, except for the few little bits in the ARC-AGI article.<p>So some of the reactions seems quite far-fetched. I was quite amazed at first seeing the benchmarks, but then actually read the ARC-AGI article and a few other things about how it worked, learned a bit more about the different benchmarks, and realised we&#x27;ve no proper idea yet how o3 is working under the hood, the thing isn&#x27;t even realeased.<p>It could be doing the same thing that chess-engines do except in several specific domains. Which would be very cool, but not necessarily &quot;intelligent&quot; or &quot;generally intelligent&quot; in any sense whatsoever! Will that kind of model lead to finding novel mathematical proofs, or actually &quot;reasoning&quot; or &quot;thinking&quot; in any way similar to a human, remains entirely uncertain.
skizm5 个月前
This might sound dumb, and I&#x27;m not sure how to phrase this, but is there a way to measure the raw model output quality without all the more &quot;traditional&quot; engineering work (mountain of `if` statements I assume) done on top of the output? And if so, would that be a better measure of when scaling up the input data will start showing diminishing returns?<p>(I know very little about the guts of LLMs or how they&#x27;re tested, so the distinction between &quot;raw&quot; output and the more deterministic engineering work might be incorrect)
评论 #42474261 未加载
joshdavham5 个月前
A lot of the comments seem very dismissive and a little overly-skeptical in my opinion. Why is this?
whimsicalism5 个月前
We need to start making benchmarks in memory &amp; continued processing over a task over multiple days, handoffs, etc (ie. &#x27;agentic&#x27; behavior). Not sure how possible this is.
mensetmanusman5 个月前
I’m super curious as to whether this technology completely destroys the middle class, or if everyone becomes better off because productivity is going to skyrocket.
评论 #42475278 未加载
评论 #42484456 未加载
评论 #42473909 未加载
blixt5 个月前
These results are fantastic. Claude 3.5 and o1 are already good enough to provide value, so I can&#x27;t wait to see how o3 performs comparatively in real-world scenarios.<p>But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it&#x27;s very difficult for an application developer to fix that.<p>Models will need better ways to remember and learn from doing a task over and over. For example, let&#x27;s look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn&#x27;t fit we branch out to multiple models to prune the context window until it does fit. And here&#x27;s the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I&#x27;m hoping session-based learning becomes the next frontier for LLM capabilities.<p>(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)
mattfrommars5 个月前
Guys, its already happening. I recently got laid off due to AI taking over my jobs.
评论 #42477255 未加载
didibus5 个月前
I&#x27;m skeptical of these benchmarks. I mean, look at the problem it&#x27;s solving? I&#x27;m sorry, this is our benchmark of AGI, it will never fly with the common person when someone claims AGI, and all it did was fill a grid of pixels.<p>Was it zero-shot at least and Pass@1 ? I guess it was not zero-shot, since it shows examples of other similar problems and their solutions. It also sounds like it was fine-tuned on that specific task.<p>Look, maybe this shows that it could soon be used to replace some MTurk style workers, but I don&#x27;t know that counts as AGI. To me AGI, it needs to be able to solve novel problems, to adapt to all situations without fine-tuning, and to operate at much larger dimensions, like don&#x27;t make it a grid of pixels, make it 4k images at least.
submeta5 个月前
I pay for lots of models, but Claude Sonnet is the one I use most. ChatGPT is my quick tool for short Q&amp;As because it’s got a desktop app. Even Google‘s new offerings did not lure me away from Claude which I use daily for hours via a Teams plan with five seats.<p>Now I am wondering what Anthropic will come up with. Exciting times.
评论 #42474765 未加载
评论 #42475435 未加载
imranq5 个月前
Based on the chart, the Kaggle SOTA model is far more impressive. These O3 models are more expensive to run than just hiring a mechanical turk worker. It&#x27;s nice we are proving out the scaling hypothesis further, it&#x27;s just grossly inelegant.<p>The Kaggle SOTA performs 2x as well as o1 high at a fraction of the cost
评论 #42475193 未加载
评论 #42474560 未加载
slibhb5 个月前
Interesting about the cost:<p>&gt; Of course, such generality comes at a steep cost, and wouldn&#x27;t quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.
p0w3n3d5 个月前
We&#x27;re speaking recently a lot about ecology. I wonder how much CO2 is emitted during such a task, as additional cost to the cloud. I&#x27;m concerned, because greedy companies will happily replace humans with AI and they will probably plant a few trees to show how they care. But energy does not come from the sun, at least not always and not everywhere... And speaking with AI customer specialist that is motivated to reject my healthcare bills, working for my insurance company is one of the darkest future views...
评论 #42476005 未加载
评论 #42476011 未加载
madsgarff5 个月前
Moreover, ARC-AGI-1 is now saturating – besides o3&#x27;s new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.<p>If low-compute Kaggle solutions already does 81% - then why is o3&#x27;s 75.7% considered such a breakthrough?
bilsbie5 个月前
Does anyone have prompts they like to use to test the quality of new models?<p>Please share. I’m compiling a list.
laurent_du5 个月前
The real breakthrough is the 25% on Frontier Math.
rapjr95 个月前
Does anyone have a feeling for how latency (from asking a question&#x2F;API call to getting an answer&#x2F;API return) is progressing with new models? I see 1.3 minutes&#x2F;task and 13.8 minutes&#x2F;task mentioned in the page on evaluating O3. Efficiency gains that also reduce latency will be important and some of them will come from efficiency in computation, but as models include more and more layers (layers of models for example) the overall latency may grow and faster compute times inside each layer may only help somewhat. This could have large effects on usability.
usaar3335 个月前
For what it&#x27;s worth, I&#x27;m much more impressed with the frontier math score.
hypoxia5 个月前
Many are incorrectly citing 85% as human-level performance.<p>85% is just the (semi-arbitrary) threshold for the winning the prize.<p>o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.<p>...<p>Here&#x27;s the full breakdown by dataset, since none of the articles make it clear --<p>Private Eval:<p>- 85%: threshold for winning the prize [1]<p>Semi-Private Eval:<p>- 87.5%: o3 (unlimited compute) [2]<p>- 75.7%: o3 (limited compute) [2]<p>Public Eval:<p>- 91.5%: o3 (unlimited compute) [2]<p>- 82.8%: o3 (limited compute) [2]<p>- 64.2%: human average (Mechanical Turk) [1] [3]<p>Public Training:<p>- 76.2%: human average (Mechanical Turk) [1] [3]<p>...<p>References:<p>[1] <a href="https:&#x2F;&#x2F;arcprize.org&#x2F;guide" rel="nofollow">https:&#x2F;&#x2F;arcprize.org&#x2F;guide</a><p>[2] <a href="https:&#x2F;&#x2F;arcprize.org&#x2F;blog&#x2F;oai-o3-pub-breakthrough" rel="nofollow">https:&#x2F;&#x2F;arcprize.org&#x2F;blog&#x2F;oai-o3-pub-breakthrough</a><p>[3] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2409.01374" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2409.01374</a>
评论 #42475071 未加载
评论 #42475048 未加载
devoutsalsa5 个月前
When the source code for these LLMs gets leaked, I expect to see:<p><pre><code> def letter_count(string, letter): if string == “strawberry” and letter == “r”: return 3 …</code></pre>
评论 #42474639 未加载
cambaceres5 个月前
Can someone explain to me why this is such a big big deal? I don&#x27;t know much about AI, but I&#x27;m a software developer with a degree in computer science.
notRobot5 个月前
Humans can take the test here to see what the questions are like: <a href="https:&#x2F;&#x2F;arcprize.org&#x2F;play" rel="nofollow">https:&#x2F;&#x2F;arcprize.org&#x2F;play</a>
digitcatphd5 个月前
o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time – and it does so via a form of LLM-guided natural language program search<p>&gt; This is significant, but I am doubtful it will be as meaningful as people expect aside from potentially greater coding tasks. Without a &#x27;world model&#x27; that has a contextual understanding of what it is doing, things will remain fundamentally throttled.
6gvONxR4sf7o5 个月前
I&#x27;m glad these stats show a better estimate of human ability than just the average mturker. The graph here has the average mturker performance as well as a STEM grad measurement. Stuff like that is why we&#x27;re always feeling weird that these things supposedly outperform humans while still sucking. I&#x27;m glad to see &#x27;human performance&#x27; benchmarked with more variety (attention, time, education, etc).
DiscourseFan5 个月前
a little from column A, a little from column B<p>I don&#x27;t think this is AGI; nor is it something to scoff at. Its impressive, but its also not human-like intelligence. Perhaps human-like intelligence is not the goal, since that would imply we have even a remotely comprehensive understanding of the human mind. I doubt the mind operates as a single unit anyway, a human&#x27;s first words are &quot;Mama,&quot; not &quot;I am a self-conscious freely self-determining being that recognizes my own reasoning ability and autonomy.&quot; And the latter would be easily programmable anyway. The goal here might, then, be infeasible: the concept of free will is a kind of technology in and of itself, it has already augmented human cognition. How will these technologies not augment the &quot;mind&quot; such that our own understanding of our consciousness is altered? And why should we try to determine ahead of time what will hold weight for us, why the &quot;human&quot; part of the intelligence will matter in the future? Technology should not be compared to the world it transforms.
neom5 个月前
Why would they give a cost estimate per task on their low compute mode but not their high mode?<p>&quot;low compute&quot; mode: Uses 6 samples per task, Uses 33M tokens for the semi-private eval set, Costs $17-20 per task, Achieves 75.7% accuracy on semi-private eval<p>The &quot;high compute&quot; mode: Uses 1024 samples per task (172x more compute), Cost data was withheld at OpenAI&#x27;s request, Achieves 87.5% accuracy on semi-private eval<p>Can we just extrapolate $3kish per task on high compute? (wondering if they&#x27;re withheld because this isn&#x27;t the case?)
评论 #42473963 未加载
Havoc5 个月前
If I&#x27;m reading that chart right that means still log scaling &amp; we should still be good with &quot;throw more power&quot; at it for a while?
ChildOfChaos5 个月前
This is insanely expensive to run though. Looks like it cost around $1 million of compute to get that result.<p>Doesn&#x27;t seem like such a massive breakthrough when they are throwing so much compute at it, particularly as this is test time compute, it just isn&#x27;t practical at all, you are not getting this level with a ChatGPT subscription, even the new $200 a month option.
评论 #42476415 未加载
pixelsort5 个月前
&gt; You&#x27;ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.<p>No, we won&#x27;t. All that will tell us is that the abilities of the humans who have attempted to discern the patterns of similarity among problems difficult for auto-regressive models has once again failed us.
评论 #42475747 未加载
nickorlow5 个月前
Not that I don&#x27;t think costs will dramatically decrease, but the $1000 cost per task just seems to be per one problem on ARC-AGI. If so, I&#x27;d imagine extrapolating that to generating a useful midsized patch would be like 5-10x<p>But only OpenAI really knows how the cost would scale for different tasks. I&#x27;m just making (poor) speculation
Woodi5 个月前
So article seriously and scientifically states:<p>&quot;Our programs compilation (AI) gave 90% of correct answers in test 1. We expect that in test 2 quality of answers will degenerate to below random monkey pushing buttons levels. Now more money is needed to prove we hit blind alley.&quot;<p>Hurray ! Put limited version of that on everybody phones !
parsimo20105 个月前
I really like that they include reference levels for an average STEM grad and an average worker for Mechanical Turk. So for $350k worth of compute you can have slightly better performance than a menial wage worker, but slightly worse performance than a college grad. Right now humans win on value, but AI is catching up.
评论 #42476590 未加载
Animats5 个月前
The graph seems to indicate a new high in cost per task. It looks like they came in somewhere around $5000&#x2F;task, but the log scale has too few markers to be sure.<p>That may be a feature. If AI becomes too cheap, the over-funded AI companies lose value.<p>(1995 called. It wants its web design back.)
评论 #42474784 未加载
评论 #42476287 未加载
tripletao5 个月前
Their discussion contains an interesting aside:<p>&gt; Moreover, ARC-AGI-1 is now saturating – besides o3&#x27;s new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.<p>So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn&#x27;t yet seem obvious those outperform human-designed domain-specific approaches.<p>I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there&#x27;s no way to overfit to the private tasks), but would be different from an &quot;accidental&quot; emergent improvement.
dyauspitr5 个月前
I wish there was a way to see all the attempts it got right graphically like they show the incorrect ones.
siva75 个月前
Seriously, programming as a profession will end soon. Let&#x27;s not kid us anymore. Time to jump the ship.
评论 #42475811 未加载
评论 #42479410 未加载
c1b5 个月前
How does o3 know when to stop reasoning?
评论 #42474645 未加载
评论 #42479015 未加载
codedokode5 个月前
I wonder, what is the main obstacle in making robots for mechanical tasks, like laying bricks, paving a road or working in the shaft? It doesn&#x27;t look like something that requires lot of mathematical or programming skills, just good vision and manipulators.
评论 #42481557 未加载
polskibus5 个月前
What are the differences between the public offering and o3? What is o3 doing differently? Is it something akin to more internal iterations, similar to „brute forcing” a problem, like you can yourself with a cheaper model, providing additional hints after each response?
gmerc5 个月前
Headline could also just be OpenAI discovers exponential scaling wall for inference time compute.
wilg5 个月前
fun! the benchmarks are so interesting because real world use is so variable. sometimes 4o will nail a pretty difficult problem, other times o1 pro mode will fail 10 times on what i would think is a pretty easy programming problem and i waste more time trying to do it with ai
gxt5 个月前
I don&#x27;t care about some scores going up. Newer models need to stop regressing on tasks they were already good at. 4o sucks at LLVM and related tasks were as legacy GPT 4 is relatively ok at it.
up2isomorphism5 个月前
So what do they test? Some matrix and some matrix out? It does look like “agi” to me.
ghm21805 个月前
Wouldn&#x27;t one then built the analog of the lisp computer to hyper optimize just this. Like it might be super expensive for regular gpus but for super specialized architecture one could shave the 3500$&#x2F;hour quite a bit no?
freediver5 个月前
Wondering what are author&#x27;s thoughts on the future of this approach to benchmarking? Completing super hard tasks while then failing on &#x27;easy&#x27; (for humans) ones might signal measuring the wrong thing, similar to Turing test.
nprateem5 个月前
There should be a benchmark that tells the AI it&#x27;s previous answer was wrong and test the number of times it either corrects itself or incorrectly capitulates, since it seems easy to trip them up when they are in fact right.
tikkun5 个月前
I wonder: when did o1 finish training, and when did o3 finish training?<p>There&#x27;s a ~3 month delay between o1&#x27;s launch (Sep 12) and o3&#x27;s launch (Dec 20). But, it&#x27;s unclear when o1 and o3 each finished training.
bsaul5 个月前
i&#x27;m surprised there even is a training dataset. Wasn&#x27;t the whole point to test whether models could show proof of original reasoning beyond patterns recognition ?
heliophobicdude5 个月前
We should NOT give up on scaling pretraining just yet!<p>I believe that we should explore pretraining video completion models that explicitly have no text pairings. Why? We can train unsupervised like they did for GPT series on the text-internet but instead on YouTube lol. Labeling or augmenting the frames limits scaling the training data.<p>Imagine using the initial frames or audio to prompt the video completion model. For example, use the initial frames to write out a problem on a white board then watch in output generate the next frames the solution being worked out.<p>I fear text pairings with CLIP or OCR constrain a model too much and confuse
starchild30015 个月前
Intelligence comes in many forms and flavors. ARC prize questions are just one version of it -- perhaps measuring more human-like pattern recognition than true intelligence.<p>Can machines be more human-like in their pattern recognition? O3 met this need today.<p>While this is some form of accomplishment, it&#x27;s nowhere near the scientific and engineering problem solving needed to call something truly artificial (human-like) intelligent.<p>What’s exciting is that these reasoning models are making significant strides in tackling eng and scientific problem-solving. Solving the ARC challenge seems almost trivial in comparison to that.
ziofill5 个月前
It&#x27;s certainly remarkable, but let&#x27;s not ignore the fact that it still fails on puzzles that are trivial for humans. Something is amiss.
vjerancrnjak5 个月前
The result on Epoch AI Frontier Math benchmark is quite a leap. Pretty sure most people couldn’t even approach these problems, unlike ARC AGI
评论 #42477452 未加载
cryptoegorophy5 个月前
Besides higher scores - is there any improvements for a general use? Like asking to help setup home assistant etc etc?
the5avage5 个月前
The examples unsolved by high compute o3 look a lot like the raven progressive matrix tests used in IQ tests.
YeGoblynQueenne5 个月前
I guess I get to brag now. ARC AGI has no real defences against Big Data, memorisation-based approaches like LLMs. I told you so:<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42344336">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42344336</a><p>And that answers my question about fchollet&#x27;s assurances that LLMs without TTT (Test Time Training) can&#x27;t beat ARC AGI:<p>[me] I haven&#x27;t had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?<p>[fchollet] &gt;&gt; For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to &lt;10% accuracy.
评论 #42477408 未加载
almog5 个月前
AGI ⇒ ARC-AGI-PUB<p>And not the other way around as some comments here seem to confuse necessary and sufficient conditions.
zug_zug5 个月前
This is a lot of noise around what&#x27;s clearly not even an order of magnitude to the way to AGI.<p>Here&#x27;s my AGI test - Can the model make a theory of AGI validation that no human has suggested before, test itself to see if it qualifies, iterate, read all the literature, and suggest modifications to its own network to improve its performance?<p>That&#x27;s what a human-level performer would do.
inoperable5 个月前
Very convenient for OpenAI to run those errands with bunch of misanthropes trying to repaint a simulacrum. To use AGI here&#x27;s makes me want to sponsor pile of distress pills so people think things really over before going into another mania Episode. People need seriously take a step back, if that&#x27;s AGI then my cat has surpassed it&#x27;s cognitive acting twice.
amai5 个月前
But can it convert handwritten equations into Latex? That is the AGI task I&#x27;m waiting for.
评论 #42483066 未加载
评论 #42481547 未加载
binarymax5 个月前
All those saying &quot;AGI&quot;, read the article and especially the section &quot;So is it AGI?&quot;
viivii295 个月前
Use the three prospectives with mini
cubefox5 个月前
This was a surprisingly insightful blog post, going far beyond just announcing the o3 results.
killjoywashere5 个月前
I just want it to do my laundry.
baalimago5 个月前
Let me know when OpenAI can wrap Christmas gifts. Then I&#x27;ll be interested.
评论 #42482265 未加载
earth2mars5 个月前
Maybe spend more compute time to let it think about optimizing the compute time.
niemandhier5 个月前
Contrary to many I hope this stays expensive. We are already struggling with AI curated info bubbles and psy-ops as it is.<p>State actors like Russia, US and Israel will probably be fast to adopt this for information control, but I really don’t want to live in a world where the average scammer has access to this tech.
评论 #42478538 未加载
hcwilk5 个月前
I just graduated college, and this was a major blow. I studied Mechanical Engineering and went into Sales Engineering because cause I love technology and people, but articles like this do nothing but make me dread the future.<p>I have no idea what to specialize in, what skills I should master, or where I should be spending my time to build a successful career.<p>Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.
评论 #42476966 未加载
评论 #42476835 未加载
评论 #42477445 未加载
评论 #42477282 未加载
评论 #42477144 未加载
评论 #42476914 未加载
评论 #42477532 未加载
评论 #42476838 未加载
评论 #42477284 未加载
评论 #42477025 未加载
评论 #42477264 未加载
评论 #42477502 未加载
评论 #42477312 未加载
评论 #42477333 未加载
评论 #42476827 未加载
评论 #42477130 未加载
评论 #42477104 未加载
评论 #42476854 未加载
评论 #42477500 未加载
评论 #42477433 未加载
评论 #42477258 未加载
评论 #42476882 未加载
评论 #42477140 未加载
评论 #42477298 未加载
评论 #42477072 未加载
评论 #42477687 未加载
评论 #42477068 未加载
评论 #42476812 未加载
评论 #42477040 未加载
评论 #42476833 未加载
评论 #42476880 未加载
评论 #42477022 未加载
myrloc5 个月前
What is the cost of &quot;general intelligence&quot;? What is the price?
评论 #42476849 未加载
viivii295 个月前
Use the three prospectives
thatxliner5 个月前
&gt; verified easy for humans, harder for AI<p>Isn’t that the premise behind the CAPTCHA?
jaspa995 个月前
Can it play Mario 64 now?
pal90005 个月前
Can someone ELI5 how ARC-AGI-PUB is resistant to p-hacking?
danielovichdk5 个月前
At what time will it kill us all because it understands that humans are the biggest problem before it can simply chill and not worry.<p>That would be intelligent. Everything else is just stupid and more of the same shit.
评论 #42479094 未加载
prng20215 个月前
I’m confused about the excitement. Are people just flat out ignoring the sentences below? I don’t see any breakthrough towards AGI here. I see a model doing great in another AI test but about to abysmally fail a variation of it that will come out soon. Also, aren’t these comparisons completely nonsense considering it’s o3 tuned vs other non-tuned?<p>&gt; Note on &quot;tuned&quot;: OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.<p>&gt; Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).
评论 #42478924 未加载
bilsbie5 个月前
When is this available? Which plans can use it?
iLoveOncall5 个月前
It&#x27;s beyond ridiculous how the definition of AGI has shifted from being an AI that&#x27;s so good it can improve itself entirely independently infinitely to &quot;some token generator that can solve puzzles that kids could solve after burning tens of thousands of dollars&quot;.<p>I spend 100% of my work time working on a GenAI project, which is genuinely useful for many users, in a company that everyone has heard about, yet I recognize that LLMs are simply dogshit.<p>Even the current top models are barely usable, hallucinate constantly, are never reliable and are barely good enough to prototype with while we plan to replace those agents with deterministic solutions.<p>This will just be an iteration on dogshit, but it&#x27;s the very tech behind LLMs that&#x27;s rotten.
suprgeek5 个月前
Don&#x27;t be put off by the reported high-cost<p>Make it possible-&gt;Make it fast-&gt;Make it Cheap<p>the eternal cycle of software.<p>Make no mistake - we are on the verge of the next era of change.
c1b5 个月前
So o1 pro is CoT RL and o3 adds search?
itfossil5 个月前
The amount of desperate rationalization in this thread is unbelievable. It&#x27;s like watching people at a Pentecostal church start speaking in tongues in the hope that something wonderful will happen until it evolves into the realization that shit isn&#x27;t going to happen and then slowly they just kind of putter out.<p>TLDR: The cacophony of fools is so loud now. Thank goodness it won&#x27;t last.
panabee5 个月前
Nadella is a superb CEO, inarguably among the best of his generation. He believed in OpenAI when no one else did and deserves acclaim for this brilliant investment.<p>But his &quot;below them, above them, around them&quot; quote on OpenAI may haunt him in 2025&#x2F;2026.<p>OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft&#x27;s straitjacket.<p>Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft&#x27;s grasp grows by the day.<p>Coupled with research and hardware trends, OAI&#x27;s product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.<p>If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG&#x27;s cannibal quote about Altman feels so apt.<p>It will be fascinating to see how this unfolds.<p>Congrats to OAI on yet another fantastic release.
评论 #42562992 未加载
brcmthrowaway5 个月前
How to invest in this stonk market
thom5 个月前
It’s not AGI when it can do 1000 math puzzles. It’s AGI when it can do 1000 math puzzles then come and clean my kitchen.
评论 #42479300 未加载
评论 #42479291 未加载
behnamoh5 个月前
So now not only are the models closed, but so are their evals?! This is a &quot;semi-private&quot; eval. WTH is that supposed to mean? I&#x27;m sure the model is great but I refuse to take their word for it.
评论 #42473569 未加载
评论 #42473580 未加载
评论 #42473443 未加载
kirab5 个月前
FYI: Codeforces competitive programming scores (basically only) by time needed until valid solutions are posted<p><a href="https:&#x2F;&#x2F;codeforces.com&#x2F;blog&#x2F;entry&#x2F;133094" rel="nofollow">https:&#x2F;&#x2F;codeforces.com&#x2F;blog&#x2F;entry&#x2F;133094</a><p>That means.. this benchmark is just saying o3 can write code faster than must humans (in a very time-limited contest, like 2 hours for 6 tasks). Beauty, readability or creativity is not rated. It’s essentially a &quot;how fast can you make the unit tests pass&quot; kind of competition.
评论 #42477449 未加载
tmaly5 个月前
Just curious, I know o1 is a model OpenAI offers. I have never heard of the o3 model. How does it differ from o1?
TypicalHog5 个月前
This is actually mindblowing!
rimeice5 个月前
Never underestimate a droid
jack_pp5 个月前
AGI for me is something I can give a new project to and be able to use it better than me. And not because it has a huge context window, because it will update its weights after consuming that project. Until we have that I don&#x27;t believe we have truly reached AGI.<p>Edit: it also <i>tests</i> the new knowledge, it has concepts such as trusting a source, verifying it etc. If I can just gaslight it into unlearning python then it&#x27;s still too dumb.
Sparkyte5 个月前
Kinda expensive though.
earth2mars5 个月前
Why did they skip o2?
评论 #42482329 未加载
Havoc5 个月前
Did they just skip o2?
评论 #42476569 未加载
epigramx5 个月前
I bet it still thinks 1+1=3 if it read enough sources parroting that.
dkrich5 个月前
These tests are meaningless until You show them doing mundane tasks
epolanski5 个月前
Okay but what are the tests like? At least like a general idea.
cchance5 个月前
Is it just me or does looking at the ARC-AGI example questions at the bottom... make your brain hurt?
评论 #42473703 未加载
评论 #42474036 未加载
kittikitti5 个月前
Congratulations
jdefr895 个月前
Uhhhh… It was trained on ARC data? So they targeted a specific benchmark and are surprised and blown away the LLM performed well in it? What’s that law again? When a benchmark is targeted by some system the benchmark becomes useless?
评论 #42476434 未加载
theincredulousk5 个月前
Denoting it in $ for efficiency is peak capitalism, cmv.
airstrike5 个月前
Uhh...some of us are apparently living under a rock, as this is the first time I hear about o3 and I&#x27;m on HN far too much every day
评论 #42473907 未加载
评论 #42474012 未加载
sys327685 个月前
So in a few years, coders will be as relevant as cuneiform scribes.
评论 #42476224 未加载
uncomplexity_5 个月前
it&#x27;s official old buddy, i&#x27;m a has been.
owenpalmer5 个月前
Someone asked if true intelligence requires a foundation of prior knowledge. This is the way I think about it.<p>I = E &#x2F; K<p>where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.<p>For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to &quot;figure out&quot; the math, without using any of the &quot;tricks&quot; that A already knew.<p>Now back to the question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers the question.<p>Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.<p>IK = E<p>low intelligence * vast knowledge = reasonable effectiveness
评论 #42478627 未加载
评论 #42478808 未加载
评论 #42478492 未加载
评论 #42478755 未加载
评论 #42478877 未加载
评论 #42478476 未加载
评论 #42478928 未加载
评论 #42478449 未加载
评论 #42478912 未加载
评论 #42479001 未加载
duluca5 个月前
The first computers cost millions of dollars and filled entire rooms to accomplish what we would now consider simple computational tasks. That same computing power now fits into the width of a finger nail. I don’t get how technologists balk at the cost of experimental tech or assume current tech will run at the same efficiency for decades to come and melt the planet into a puddle. AGI won’t happen until you can fit enough compute that’d take several data center’s worth of compute into a brain sized vessel. So the thing can move around process the world in real time. This is all going to take some time to say the least. Progress is progress.
评论 #42478084 未加载
评论 #42477763 未加载
评论 #42478271 未加载
评论 #42477771 未加载
评论 #42478026 未加载
评论 #42478368 未加载
评论 #42477951 未加载
评论 #42479009 未加载
评论 #42477856 未加载
评论 #42477957 未加载
__MatrixMan__5 个月前
With only a 100x increase in cost, we improved performance by 0.1x and continued plotting this concave-down diminishing-returns type graph! Hurray for logarithmic x-axes!<p>Joking aside, better than ever before at <i>any</i> cost is an achievement, it just doesn&#x27;t exactly scream &quot;breakthrough&quot; to me.
评论 #42476660 未加载
评论 #42477533 未加载
评论 #42478164 未加载
评论 #42477054 未加载
评论 #42476632 未加载
评论 #42479025 未加载
评论 #42476643 未加载
评论 #42476595 未加载
demirbey055 个月前
It is not exactly AGI but huge step toward it. I would expect this step in 2028-2030. I cant really understand why people are happy with it, this technology is so dangerous that can disrupt whole society. It&#x27;s neither like smartphone nor internet. What will happen to 3rd world countries. Lots of unsolved questions and world is not prepared for such a change. Lots of people will lose their jobs I am not even mentioning their debts. No one will have chance to be rich anymore, If you are in first world country you will probably get UBI, if not you wont.
评论 #42475264 未加载
评论 #42475232 未加载
评论 #42475168 未加载
评论 #42476913 未加载
评论 #42475695 未加载
评论 #42475514 未加载
评论 #42475815 未加载
og_kalu5 个月前
This is also wildly ahead in SWE-bench (71.7%, previous 48%) and Frontier Math (25% on high compute, previous 2%).<p>So much for a plateau lol.
评论 #42473439 未加载
评论 #42473470 未加载
评论 #42476274 未加载
评论 #42473461 未加载
评论 #42473629 未加载
评论 #42473414 未加载
lagrange775 个月前
&gt; You&#x27;ll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.<p>That&#x27;s the most plausible definition of AGI i&#x27;ve read so far.
评论 #42476014 未加载
评论 #42476008 未加载
braden-lk5 个月前
If people constantly have to ask if your test is a measure of AGI, maybe it should be renamed to something else.
评论 #42473490 未加载
评论 #42473417 未加载
maxdoop5 个月前
How much longer can I get paid $150k to write code ?
评论 #42473487 未加载
评论 #42473587 未加载
评论 #42473505 未加载
评论 #42473800 未加载
评论 #42476317 未加载
评论 #42473454 未加载
评论 #42473436 未加载
评论 #42474866 未加载
评论 #42473778 未加载
评论 #42475772 未加载
评论 #42473653 未加载
razodactyl5 个月前
Great. Now we have to think of a new way to move the goalposts.
评论 #42473871 未加载
评论 #42473616 未加载
评论 #42473937 未加载
评论 #42473472 未加载
评论 #42473392 未加载
评论 #42473412 未加载
评论 #42473662 未加载
sakopov5 个月前
Maybe I&#x27;m missing something vital, but how does anything that we&#x27;ve seen AI do up until this point or explained in this experiment even hint at AGI? Can any of these models ideate? Can they come up with technologies and tools? No and it&#x27;s unlikely they will any time soon. However, they can make engineers infinitely more productive.
评论 #42477371 未加载
rvz5 个月前
Great results. However, let&#x27;s all just admit it.<p>It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers. The ultimate intention of &quot;AGI&quot; is that it is going to replace tens of millions of jobs. That is it and you know it.<p>It will only accelerate and we need to stop pretending and coping. Instead lets discuss solutions for those lost jobs.<p>So what is the replacement for these lost jobs? (It is not UBI or &quot;better jobs&quot; without defining them.)
评论 #42474357 未加载
评论 #42474327 未加载
评论 #42473851 未加载
评论 #42473641 未加载
CliveBloomers5 个月前
Another meaningless benchmark, another month—it’s like clockwork at this point. No one’s going to remember this in a month; it’s just noise. The real test? It’s not in these flashy metrics or minor improvements. The only thing that actually matters is how fast it can wipe out the layers of middle management and all those pointless, bureaucratic jobs that add zero value.<p>That’s the true litmus test. Everything else? It’s just fine-tuning weights, playing around the edges. Until it starts cutting through the fat and reshaping how organizations really operate, all of this is just more of the same.
评论 #42474457 未加载
评论 #42474230 未加载
评论 #42484467 未加载
noah325 个月前
The best AI on this graph costs 50000% more than a stem graduate to complete the tasks and even then has an error rate that is 1000% higher than the humans???
agnosticmantis5 个月前
This is so impressive that it brings out the pessimist in me.<p>Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.<p>Also given the prohibitive compute costs per task, typical users won&#x27;t be using this model, so the scheme could go on for quite sometime before the public knows the truth.<p>They could also come out in a month and say o3 was so smart it&#x27;d endanger the civilization, so we deleted the code and saved humanity!
评论 #42476353 未加载
评论 #42476397 未加载
评论 #42476357 未加载