LIMO: Less Is More for Reasoning

389 pointsby trott3 months ago

25 comments

Cool result, but worth highlighting two points:- Model is finetuned from Qwen-2.5 Instruct, which includes millions of specially filtered math examples in both pretraining and supervised fine-tuning already.- To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data. It’s not very clear to me if this is more or less impressive than getting the same result by simply fine-tuning on the 10 million initial pool, but I suppose that would make for a worse headline.

评论 #42993175 未加载

评论 #42992859 未加载

评论 #42993294 未加载

评论 #42996012 未加载

评论 #42999665 未加载

评论 #42995256 未加载

评论 #42992788 未加载

评论 #42995779 未加载

评论 #42999322 未加载

hexomancer3 months ago

Here is how I make sense of it (I have no expertise in this subject, please feel free to correct me if I am wrong): I think when the model is pretrained on the internet, it does gain most of the skills required to do mathematical reasoning, however, since its task is to predict the next word distribution on the entire internet, it does not normally use this ability, since most of the text on the internet is not this type of reasoning text (think of generative image models a few years ago, where appending "unreal engine" to a prompt would significantly improve the quality of the output, the reason was that the model was trained to generate the distribution of the images on the internet, most of them are not particularly impressive, however, since images containing "unreal engine" were usually high-quality screenshots of images, it would also move the distribution of generated images towards higher quality generations). So I think the model already has most of the ability, it just needs to adjust a few connections to actually utilize this latent skill, so it makes sense that a few training examples are enough to adjust the connections to increase mathematical reasoning skills.

评论 #42992266 未加载

评论 #43003018 未加载

评论 #42994116 未加载

评论 #42993059 未加载

easeout3 months ago

My guess at the upshot: Some domains, like math, are general but have outsized effective vocabularies like all possible numbers, which makes them more expensive to train by the same method that works for domains of regular-sized vocabularies. If you train for reasoning steps in such a problem domain, you can reinforce the comparatively few general terms of the vocabulary like "add", "inverse", "solve". And that leaves the arithmetic of number combinations separate from particular problems because you're not emphasizing one-shot answers. You can train N reasoning cases + M arithmetic cases instead of N*M whole math problems. So you have to use more inference power but you can get better answers for less training.Theory aside, I would think a good application-side method is to use this general reasoning process to structure a final expression and then pass that through a traditional evaluator. Then the reasoning and training thereof need only go as far as symbol manipulation. This is something like Wolfram Alpha, if its NLP handed off to the evaluator much later in the process.

评论 #42992690 未加载

igleria3 months ago

I think I've recently read two seemingly contradicting things:1- LLMs can never generalize theorem proving2- this paper: "This suggests that contemporary LLMs may already possess rich mathematical knowledge in their parameter space, transforming the challenge from knowledge acquisition to knowledge elicitation"Not sure what is what anymore!

评论 #42992247 未加载

评论 #42993465 未加载

评论 #42996826 未加载

评论 #42992171 未加载

评论 #42996086 未加载

评论 #42994595 未加载

评论 #42992157 未加载

doug_durham3 months ago

In the same way that image diffusion models showed that convincing approximations of the entire visual world could be summarized in a 5GB model, are "reasoning patterns" similarly compressible? Are there actually countably few reasoning patterns that are used across all domains, and as such can be captured with relatively small training sets?

评论 #42992197 未加载

guyomes3 months ago

I wonder if their curated set of 817 math problems is also useful as teaching material for training math students on a diverse set of problems.

Limoynada3 months ago

If the LIMO hypothesis about the existence of a latent capacity for efficient reasoning in small models that can be elicited by finetuning the model with a small datasets is true, then we could see a huge transference of power from huge models to small models and that in a recurrent way seems to offer unlimited power. But to feed that loop there should be a property of those datasets, they teach the model to adapt reasoning to model size and that is verified by the model extending the depth of the reasoning chain using a small branching factor in the exploration space, like a minimum cover to detect deep patterns.

sega_sai3 months ago

It is interesting how the field is becoming 'pedagogy of LLMs'.

评论 #42992289 未加载

akomtu3 months ago

Reasoning is the art of prediction. Reasoning is distilling many observations of reality into a tiny model of reality that predicts new observations well enough. "What's the simplest model that explains most of what I'm seeing?" is the main question our mind tries to answer. When the art of creating such models is mastered, we pattern-match new problems to our models and use them to predict the outcome.

fpgaminer3 months ago

I noticed a similar phenomenon in my work on JoyCaption when I began teaching it VQA. JoyCaption was trained on about 800k image-caption pairs, and built from so400m and Llama 3.1 8B Instruct. There's no VQA data in its training.As an experiment, I hand built a VQA dataset of ~600 examples, which is a vanishingly small number compared to even rudimentary VQA datasets (which tend to be about 10k examples or more). However, I ensured that the dataset was broad and highly varied, and that the queries aggressively exercised both visual and textual understanding.With only 600 training examples, I finetuned the base JoyCaption model in a handful of minutes and to my surprise, not only did it gain VQA abilities, it's able to generalize quite far outside of its training set. Even for concepts not in the original 800k caption data.My hypothesis is that if the training data is varied enough, it forces the model to generalize. It isn't given enough examples of any given type of task to learn specialized circuitry for them, so its only option is to learn a broadly generalized set of circuitry. The data keeps it on its toes, so to speak.Of course, this leans heavily on Llama's existing instruction (text-based) tuning, so it's starting off on good footing there. The surprising bit is being able to generalize so well to a new domain (vision) with so little data.One caveat is that this model is highly unstable, and the accuracy of its responses is much worse than the accuracy of the base model. It's able to handle all of the tasks I've tested on it, but often requires a few retries to get it right.Building these datasets is also tedious and intensive. I've yet to successfully train existing AIs to generate useful user queries/instructions/questions, either through prompting or finetuning. So it has to all be done by hand. And every answer was either written by me, or generated by an existing VLM and then edited by me to ensure perfect accuracy and adherence to the request. Since the queries are complex and challenging, this makes the work of writing those answers similarly challenging and time consuming.As an aside: this training also seems to have broken Llama's alignment. I've had it be remarkably sassy in its responses, and it's much better at simulating more normal human responses.

tw19843 months ago

With really high quality samples, the reasoning ability of a well trained LLM can be activated using very small amount of SFT samples, this is what I learned from the paper. It is an interesting finding but not practical through, as you need a far more capable reasoning model (R1 in this case) to get those high quality 817 samples first. DeepSeek-R1-Distill-Qwen-32B has better reasoning skills according to the same benchmarks.Another trend I've noticed is that there are already 3 papers reporting similar findings by using Qwen-2.5-Instruct. Did they find something interesting on LLMs or something unique to Qwen-2.5-Instruct. I guess we need more experiment results to draw conclusions.

1R0533 months ago

I think the title of the paper is misleading. Obviously the result shows an impressive performance with just few training examples. However, I cannot see that while keeping the same method reducing training data leads to more performance. They have simply shifted the performance curve (impressively) to lower thresholds. Still also with this new method more training data should give better results. It would be interesting to see a full performance curve for the method based on training data amount (and potentially quality).

ak_1113 months ago

It's actually difficult to work out the affiliation of the authors for non-Chinese. SJTU = Shanghai Jiao Tong University, but couldn't work out GAIR and IIS.

评论 #42996883 未加载

elif3 months ago

So it sounds like we should have schizophrenic AI's which alternate and collaborate between specialized domain specific submodels. I guess the number of submodels does not cost compute, so can grow quite large, and if each of these models is so reduced as in this paper, the overall compute cost should drop substantially.

fallmonkey3 months ago

While there're interesting findings here, <a href="https://arxiv.org/pdf/2502.03373" rel="nofollow">https://arxiv.org/pdf/2502.03373</a> (also with a lot of good findings) suggested some contradicting theory on the critical mass of training process/data for the sake of reasoning capability.

antirez3 months ago

the S1 paper did the same a few days ago, basically. 1000 total CoT with SFT.I believe that all this shows that pre-training stage already creates the representations needed for CoT reasoning, so they are very simple to uncover. Either with R1-Zero pure RL, or with few-shots SFT.

xendo3 months ago

Any idea if the same dataset can be used to improve human reasoning? Let's say I manually analyze 817 math examples, would that be optimal strategy for me to improve my math reasoning? Can the same distilation process be applied to leetcode?

评论 #42993201 未加载

fabmilo3 months ago

I will believe reasoning architectures when the model knows how to store parametric information in an external memory out of the training loop.

评论 #42997224 未加载

delichon3 months ago

<pre><code> To see a World in a Grain of Sand And a Heaven in a Wild Flower, Hold Infinity in the palm of your hand And Eternity in an hour.</code></pre>

评论 #42992167 未加载

评论 #42992243 未加载

yalok3 months ago

This wonder if there’s similar research on reducing the amount of data (by improving its quality) for pretraining

评论 #42999017 未加载

ysofunny3 months ago

where's chatbotAI-zero? in the way alpha-go-zero was the best after training with itself? (and only with itself)

评论 #42992107 未加载

评论 #42992201 未加载

评论 #42992200 未加载

评论 #42992124 未加载

aymaneSennoussi3 months ago

I'm confused. This looks like a distillation of Qwen for math problems. What Am I missing?

ei6253 months ago

People here should read, especially 1.How to make less datasets 2. Categorize reasoning process into L1-L5 when evaluation.

评论 #42992259 未加载

emorning33 months ago

My conclusion from all that I'm reading lately is that LLMs cannot do deduction but they can fake it real good.I mean, you wouldn't use this brand of AI to plot your path to Mars. Well, you could, BUT you'll also want to validate the path or risk dying.But this AI is good enough for Elon and his ilk. Because Elon's not gonna get into the capsule, you are.Because you are not the master of this AI, you are the validator.

评论 #42993372 未加载

评论 #42997260 未加载

shashanoid3 months ago

Love prepending 'explain' to arxiv links these days xD <a href="https://explainarxiv.org/abs/2502.03387" rel="nofollow">https://explainarxiv.org/abs/2502.03387</a>

评论 #42992794 未加载