ARC-AGI without pretraining

351 点作者 georgehill3 个月前

12 条评论

pona-a3 个月前

I feel like extensive pretraining goes against the spirit of generality.If you can create a general machine that can take 3 examples and synthesize a program that predicts the 4th, you've just solved oracle synthesis. If you train a network on all human knowledge, including puzzle making, and then fine-tune it on 99% of the dataset and give it a dozen attempts for the last 1%, you've just made an expensive compressor for test-maker's psychology.

评论 #43262582 未加载

评论 #43260154 未加载

评论 #43260268 未加载

评论 #43259794 未加载

评论 #43263262 未加载

评论 #43264643 未加载

评论 #43259501 未加载

评论 #43261883 未加载

评论 #43260649 未加载

AIorNot3 个月前

I was thinking about this lex friedman podcast with Marcus Hutter. Also, Joshua Bach defined intelligence as the ability to accurately model reality.. is lossless compression itself intelligence or a best fit model- is there a difference? <a href="https://www.youtube.com/watch?v=E1AxVXt2Gv4" rel="nofollow">https://www.youtube.com/watch?v=E1AxVXt2Gv4</a>

评论 #43261529 未加载

评论 #43261168 未加载

评论 #43265906 未加载

cocomutator3 个月前

I'm trying to distill the essence of their approach, which imho is concealed behind inessential and particular details such as the choice of this or that compression scheme or prior distributions.It seems like the central innovation is the construction of a "model" which can be optimized with gradient descent, and whose optimum is the "simplest" model that memorizes the input-output relationships. In their setup, "simplest" has the concrete meaning of "which can be efficiently compressed" but more generally it probably means something like "whose model complexity is lowest possible".This is in stark contrast to what happens in standard ML: typically, we start by prescribing a complexity budget (e.g. by choosing the model architecture and all complexity parameters), and only then train on data to find a good solution that memorizes input-output relationship.The new method is ML on its head: we optimize the model so that we reduce its complexity as much as possible while still memorizing the input-output pairs. That this is able to generalize from 2 training examples is truly remarkable and imho hints that this is absolutely the right way of "going about" generalization.Information theory happened to be the angle from which the authors arrived at this construction, but I'm not sure that is the essential bit. Rather, the essential bit seems to be the realization that rather than finding the best model for a fixed pre-determined complexity budget, we can find models with minimal possible complexity.

评论 #43266524 未加载

评论 #43266515 未加载

EigenLord3 个月前

Interesting. I've been slowly coming to the conclusion that the way forward with machine learning is actually less "machine learning" as we've grown accustomed with it, less pretraining, less data, less search, more direct representation, symbolic processing, more constraint-satisfaction, meta-learning etc. All those things we need less of (pretraining, data, etc) are messy, brute force, and contingent. Working with them, you'll always be dependent on the quality of your data, which is fine if you want to data-mine, but not fine if you want to model the underlying causes of the data.From my (admittedly sketchy, rushed) understanding of what they're doing, they're essentially trying to uncover the minimal representation of the solution/problem space. Through their tracking of the actual structure of the problem through equivariences, they're actually deriving something like the actual underlying representation of the puzzle and how to solve them, rather than hoping to pick up on this from many solved examples.

评论 #43263017 未加载

pyinstallwoes3 个月前

Impressive documentation and explanation. Thank you.It pleases me to find this as it supports my own introspection (heck, it’s in my profile!)> Intelligence is compressing information into irreducible representation.

评论 #43263274 未加载

评论 #43263164 未加载

评论 #43263038 未加载

d--b3 个月前

> ARC-AGI, introduced in 2019, is an artificial intelligence benchmark designed to test a system’s ability to infer and generalize abstract rules from minimal examples. The dataset consists of IQ-test-like puzzles, where each puzzle provides several example images that demonstrate an underlying rule, along with a test image that requires completing or applying that rule. While some have suggested that solving ARC-AGI might signal the advent of artificial general intelligence (AGI), its true purpose is to spotlight the current challenges hindering progress toward AGIWell they kind of define intelligence as the ability to compress information into a set of rules, so yes, compression does that…

评论 #43264743 未加载

评论 #43260851 未加载

fragebogen3 个月前

(Somewhat) related Schmidhuber <a href="https://arxiv.org/abs/0812.4360" rel="nofollow">https://arxiv.org/abs/0812.4360</a>

naasking3 个月前

> Despite these constraints, CompressARC achieves 34.75% on the training set and 20% on the evaluation set—processing each puzzle in roughly 20 minutes on an RTX 4070.This phrasing suggests that each puzzle took 20 mins, so for the 100 puzzle challenge that's 33.3 hours, which exceeds the target of 12 hours for the challenge. Pretty cool approach though.

unixpickle3 个月前

This seems to be pretty much exactly a standard Bayesian deep learning approach, albeit with a heavily engineered architecture.

cgadski3 个月前

I'm very excited that we're figuring out how to use deep learning on small numbers of data points!I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.About their overall method, they write:> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.

评论 #43268268 未加载

programjames3 个月前

Here's what they did:1. Choose random samples z ~ N(μ, Σ) as the "encoding" of a puzzle, and a distribution of neural network weights p(θ) ~ N(θ, <very small variance>).2. For a given z and θ, you can decode to get a distribution of pixel colors. We want these pixel colors to match the ones in our samples, but they're not guaranteed to, so we'll have to add some correction ε.3. Specifying ε takes KL(decoded colors || actual colors) bits. If we had sources of randomness q(z), q(θ), specifying z and θ would take KL(p(z) || q(z)) and KL(p(θ) || q(θ)) bits.4. The authors choose q(z) ~ N(0, 1) so KL(p(z) || q(z)) = 0.5(μ^2 + Σ^2 - 1 - 2ln Σ). Similarly, they choose q(θ) ~ N(0, 1/2λ), and since Var(θ) is very small, this gives KL(p(θ) || q(θ)) = λθ^2.5. The fewer bits they use, the lower the Kolmogorov complexity, and the more likely it is to be correct. So, they want to minimize the number of bitsa * 0.5(μ^2 + Σ^2 - 1 - 2ln Σ) + λ * θ^2 + c * KL(decoded colors || actual colors).6. Larger a gives a smaller latent, larger λ gives a smaller neural network, and larger c gives a more accurate solution. I think all they mention is they choose c = 10a, and that λ was pretty large.They can then train μ, Σ, θ until it solves the examples for a given puzzle. Decoding will then give all the answers, including the unknown answer! The main drawback to this method is, like Gaussian splatting, they have to train an entire neural network for every puzzle. But, the neural networks are pretty small, so you could train a "hypernetwork" that predicts μ, Σ, θ for a given puzzle, and even predicts how to train these parameters.

评论 #43262043 未加载

评论 #43265830 未加载

评论 #43266354 未加载

评论 #43263134 未加载

评论 #43260426 未加载

YeGoblynQueenne3 个月前

>> The idea that efficient compression by itself lies at the heart of intelligence is not new (see, e.g., Hernández-Orallo & Minaya-Collado, 1998; Mahoney, 1999; Hutter, 2005; Legg & Hutter, 2007). Rather than revisiting those theoretical discussions, we make a practical demonstration instead.There we go again. Claim: compression <something something> intelligence. Evidence: 34.75% on ARC AGI.Like Carl Sagan once pointed out, "Observation: You couldn't see a thing. Conclusion: dinosaurs".<a href="https://www.youtube.com/watch?v=w_N_IYi2c0E&themeRefresh=1" rel="nofollow">https://www.youtube.com/watch?v=w_N_IYi2c0E&themeRefresh=1</a>

评论 #43266248 未加载