TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Arc-AGI-2 and ARC Prize 2025

188 点作者 gkamradt大约 2 个月前

16 条评论

gkamradt大约 2 个月前
Hey HN, Greg from ARC Prize Foundation here.<p>Alongside Mike Knoop and François Francois Chollet, we’re launching ARC-AGI-2, a frontier AI benchmark that measures a model’s ability to generalize on tasks it hasn’t seen before, and the ARC Prize 2025 competition to beat it.<p>In Dec ‘24, ARC-AGI-1 (2019) pinpointed the moment AI moved beyond pure memorization as seen by OpenAI&#x27;s o3.<p>ARC-AGI-2 targets test-time reasoning.<p>My view is that good AI benchmarks don&#x27;t just measure progress, they inspire it. Our mission is to guide research towards general systems.<p>Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2. Specialized AI reasoning systems (like R1 or o3-mini) are &lt;4%.<p>Every (100%) of ARC-AGI-2 tasks, however, have been solved by at least two humans, quickly and easily. We know this because we tested 400 people live.<p>Our belief is that once we can no longer come up with quantifiable problems that are &quot;feasible for humans and hard for AI&quot; then we effectively have AGI. ARC-AGI-2 proves that we do not have AGI.<p>Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation sets (semi-private, private eval) have increased to 120 tasks * Solving tasks requires more reasoning vs pure intuition * Each task has been confirmed to have been solved by at least 2 people (many more) out of an average of 7 test taskers in 2 attempts or less * Non-training task sets are now difficulty-calibrated<p>The 2025 Prize ($1M, open-source required) is designed to drive progress on this specific gap. Last year&#x27;s competition (also launched on HN) had 1.5K teams participate and had 40+ research papers published.<p>The Kaggle competition goes live later this week and you can sign up here: <a href="https:&#x2F;&#x2F;arcprize.org&#x2F;competition" rel="nofollow">https:&#x2F;&#x2F;arcprize.org&#x2F;competition</a><p>We&#x27;re in an idea-constrained environment. The next AGI breakthrough might come from you, not a giant lab.<p>Happy to answer questions.
评论 #43468268 未加载
评论 #43465254 未加载
评论 #43468318 未加载
评论 #43468015 未加载
评论 #43467810 未加载
评论 #43468067 未加载
评论 #43466394 未加载
评论 #43467579 未加载
评论 #43468706 未加载
评论 #43468455 未加载
评论 #43466647 未加载
评论 #43468931 未加载
评论 #43468081 未加载
danpalmer大约 2 个月前
&gt; and was the only benchmark to pinpoint the exact moment in late 2024 when AI moved beyond pure memorization<p>This is self-referential, the benchmark pinpointed the time when AI went from memorization to problem solving, because the benchmark requires problem solving to complete. How do we know it requires problem solving skills? Because memorization-only LLMs can&#x27;t do it but humans can.<p>I think ARC are producing some great benchmarks, and I think they probably are pushing forward the state of the art, however I don&#x27;t think they identified anything particular with o3, at least they don&#x27;t seem to have proven a step change.
评论 #43466922 未加载
falcor84大约 2 个月前
I spent half an hour playing with these now at <a href="https:&#x2F;&#x2F;arcprize.org&#x2F;play" rel="nofollow">https:&#x2F;&#x2F;arcprize.org&#x2F;play</a> and it&#x27;s fun, but I must say that they are not &quot;easy&quot;. So far I eventually solved all of the ones I&#x27;ve gone through, but several took me significantly more than the 2 tries allotted.<p>I wonder if this can be shown to be a valid IQ test, and if so, what IQ would a person need to solve e.g. 90% of them in 1 or 2 tries.
评论 #43468658 未加载
评论 #43469139 未加载
iandanforth大约 2 个月前
I&#x27;d very much like to see VLAs get in the game with ARC. When I solve these puzzles I&#x27;m imagining myself move blocks around. Much of the time I&#x27;m treating these as physics simulations with custom physics per puzzle. VLAs are particularly well suited to the kind of training and planning which might unlock solutions here.
fastball大约 2 个月前
I don&#x27;t know if this was a design goal, but I just did the first 10 Arc-AGI-2 public eval (hard) puzzles, and found them much more enjoyable (as a human) than any of the Arc-AGI-1 puzzles. That said the grid&#x2F;puzzle editor is still a little clunky – would be nice to be able to drag-to-paint and have an adjustable brush size.
neom大约 2 个月前
Maybe this is a really stupid question but I&#x27;ve been curious... are LLMs based on... &quot;Neuronormativity&quot;? Like, what neurology is an LLM based on? Would we get any benefit from looking at neurodiverse processing styles?
评论 #43467250 未加载
artificialprint大约 2 个月前
Oh boy! Some of these tasks are not hard, but require full attention and a lot of counting just to get things right! ARC3 will go 3D perhaps? JK<p>Congrats on launch, lets see how long it&#x27;ll take to get saturated
评论 #43465929 未加载
评论 #43466945 未加载
Nesco大约 2 个月前
At the very first glance, it&#x27;s like ARC 1 with some structures serving as contextual data, and more complicated symmetries &#x2F; topological transformations.<p>Now, I wonder what surprises are to be found in the full dataset.<p>The focus on solving cost efficiently discrete tasks might actually lead us towards deep learning systems that could be used reliably in production, and not just give a whoa effect or need to be constantly supervised
ipunchghosts大约 2 个月前
The computer vision community needs an dataset like this for evaluation... train in one domain and test on another. The best we have now are thr imagenet r and c datasets. Humans have no issues with domain adaptation with vision, but comouter vision models struggle in many ways sti including out of domain images.
Davidzheng大约 2 个月前
Probably openai will be &gt;60% in three months if not immediately with these $1000&#x2F;question level compute (which is the way tbh we should throw compute whenever possible that&#x27;s the main advantage of silicon intelligence)
评论 #43467372 未加载
nneonneo大约 2 个月前
Nitpick: “Public” is misspelled as “pubic” in several of the captions on that page.
评论 #43468675 未加载
评论 #43468707 未加载
momojo大约 2 个月前
Have you had any neurologists utilize your dataset? My own reaction after solving a few of the puzzles was &quot;Why is this so intuitive for me, but not for an LLM?&quot;.<p>Our human-ability to abstract things is underrated.
评论 #43466902 未加载
FergusArgyll大约 2 个月前
I&#x27;d love to hear from the ARC guys:<p>These benchmarks, and specifically the constraints placed on solving them (compute etc) seem to me to incentivize <i>the opposite</i> of &quot;general intelligence&quot;<p>Have any of the technical contributions used to win the past competition been used to advance <i>general</i> AI in any way?<p>We have transformer based systems constantly gaining capabilities. On the other hand have any of the Kaggle submissions actually advanced the field in any way outside of the ARC Challenge?<p>To me (a complete outsider, admittedly) the ARC prize seems like an operationalization of the bitter lesson
评论 #43466619 未加载
评论 #43469318 未加载
lawrenceyan大约 2 个月前
Concrete benchmarks like these are very valuable.<p>Defining the reward function, which is basically what ARC is doing, is 50% of the problem solving process.
ttol大约 2 个月前
Had to give <a href="https:&#x2F;&#x2F;reasoner.com" rel="nofollow">https:&#x2F;&#x2F;reasoner.com</a> a try on ARC-AGI-2.<p>Reasoner passed on first try.<p>“Correct!”<p>(See screenshot that shows one rated “hard” -- <a href="https:&#x2F;&#x2F;www.linkedin.com&#x2F;posts&#x2F;waynechang_tried-reasoner-on-arc-prizes-just-released-activity-7310115134092312576-NGQr" rel="nofollow">https:&#x2F;&#x2F;www.linkedin.com&#x2F;posts&#x2F;waynechang_tried-reasoner-on-...</a>)
jwpapi大约 2 个月前
Did we run out of textual tasks that are easy for humans but hard for AI, or why are the examples all graphics?
评论 #43467083 未加载
评论 #43468943 未加载