Ladder: Self-improving LLMs through recursive problem decomposition

370 pointsby fofoz2 months ago

22 comments

What the hell is going on this week?!?!? (asking positively, with a smile on my face)I have seen at least 3 interesting/mildly promising breakthroughs on ML just these past two days! I mean, a Google research team just discovered that you can combine NNs with CLAs using digital logic gates as a medium, so you could potentially reduce many kinds of non-linear problems to a simple, efficient digital circuit! And it was on the HN front page, TODAY![1]I keep seeing more mind-bending stuff related to neural nets and logic/intelligence in general, my mind has been running wild with speculation about the future and just how close we could (or could not) be to truly understanding how intelligence works from first principles.[1] <a href="https://news.ycombinator.com/item?id=43286161">https://news.ycombinator.com/item?id=43286161</a>

评论 #43288513 未加载

评论 #43288954 未加载

评论 #43291034 未加载

评论 #43290282 未加载

评论 #43290080 未加载

评论 #43288569 未加载

评论 #43291875 未加载

评论 #43288332 未加载

评论 #43293391 未加载

isaacfrond2 months ago

Reminds me of a quote by famous number theoretic mathematician Hendrik Lenstra:For every problem you can't solve, there's a simpler problem that you also can't solve.

评论 #43289594 未加载

评论 #43289839 未加载

barteloniu2 months ago

Their test time RL approach seems a bit fishy. From what I understand, TTRL works by asking a language model to generate simpler versions of the test case. Once we have the simpler problems, we run RL on them, hoping that an improvement on the simplified cases will also strengthen the model performance on the original problem.The issue is, they use a numerical integrator to verify the simpler problems. One could imagine a scenario where a barely simpler problem is generated, and the model is allowed to train on pretty much the test case knowing the ground truth. Seems like training on the test set.The rest of the paper is nice though.

评论 #43290162 未加载

mentalgear2 months ago

> We demonstrate LADDER's effectiveness in the subject of mathematical integration, improving Llama 3.2 3B's accuracy from 1% to 82% on undergraduate-level problems

评论 #43288800 未加载

niemandhier2 months ago

Frank Herbert knew it: This is basically an implementation of the mentats recursive self inspection described in Dune.

Davidzheng2 months ago

test-time training/RL is definitely the right approach for math AI in the future. It is probably one of only a few ways to spend an obscene amounts of compute at a given problem (think 10^5 gpus for a few days) and has hopes of making progress when test-time inference scaling may not at first (think if you try to do MCTS on a go position with a bad value/policy net). Alphaproof already did this but nice to see it done again--good results!

评论 #43288244 未加载

neoneye22 months ago

Sidenote: `Tufa Labs` team includes the `MindsAI` team of ARC-AGI fame. <a href="https://tufalabs.ai/team.html" rel="nofollow">https://tufalabs.ai/team.html</a>

评论 #43291098 未加载

pyryt2 months ago

Some names are just too tempting <a href="https://arxiv.org/abs/1507.02672" rel="nofollow">https://arxiv.org/abs/1507.02672</a>

thomasahle2 months ago

At the end of the paper they mention "two problems from the 2025 MIT Integration Bee qualifying exam which the system continued to answere incorrectly".They say the questions were among the most complex questions on the exam, but the first one is just<pre><code> ∫ ∛(x · ∜(x · ∜(x · √(x · √(x · ⋯ ))))) dx </code></pre> which just requires you to compute<pre><code> 1/3 + 1/(3*4) + 1/(3*4*5) + ... </code></pre> So hardly very advanced math.

评论 #43291088 未加载

vessenes2 months ago

That this works at all is pretty interesting. That it seems to work very well with math is quite interesting.That said, this paper is part of the move we have right now blurring the lines of training and inference -- part of their method involves doing some reinforcement learning on questions they don't know the answer to, but can decompose into simpler questions, and using GRPO on those with a numerical 'checker'. This reinforced model then can answer more questions.I like this. I think humans do this a lot; mulling on something, turning it over in their heads, analogizing, etc. Adding test time training is a way to do a lot more thinking than adding tokens to the context for fixed inference.Just as DeepSeek and o1/o3 show that we can increase capacity with inference-time-token generation and assessment, it looks like we can increase capacity with inference-time automated fine tuning as well.I'd hope that as these techniques solidify we'll have a new way to talk and think about this -- they are all part of the same fundamental process at some level.Either way, super cool.

mentalgear2 months ago

It's exciting to see approaches like RL and curriculum learning, which I always felt were the way to go for real self-improvement ~7y ago when training in robotics (openAI gym days), finally getting successfully applied to NLP/LLM to highly boost small model performance.(Ladder is a sort of RL self curriculum learning approach)

评论 #43288373 未加载

评论 #43288147 未加载

flakiness2 months ago

Off topic, but their site is lovely: <a href="https://tufalabs.ai/index.html" rel="nofollow">https://tufalabs.ai/index.html</a> It feels like a gold rush for sure.

cratermoon2 months ago

How many rungs of a ladder would you be willing to climb if you knew that each rung was made from half the previous rung?

daxfohl2 months ago

How much GPU would an RL like this need for tuning? Is the approach something someone could experiment with themselves, or is it like thousands of USD in cloud costs and/or years of compute if done on a laptop GPU?

评论 #43296523 未加载

nis0s2 months ago

What’s the difference between this and what Wolfram Alpha has been doing?<a href="https://www.wolfram.com/artificial-intelligence/" rel="nofollow">https://www.wolfram.com/artificial-intelligence/</a>

explosion-s2 months ago

I would love to be able to use the actual model! If I'm understanding correctly this makes small models as intelligent as much larger models like GPT4o

evjan2 months ago

I had NotebookLM make a 15 min podcast about it and listened to it while walking the dogs. It was a very interesting way of trying to understand a research paper!You need a google account to access it unfortunately. <a href="https://notebooklm.google.com/notebook/fbaba495-d4f2-48a3-a3c2-09cb826b351b/audio" rel="nofollow">https://notebooklm.google.com/notebook/fbaba495-d4f2-48a3-a3...</a>

goyel2 months ago

I wonder why nobody made a NN to find the weigths faster and better than gradient descent

majordroid2 months ago

> We also introduce TTRL (Test-Time Reinforcement Learning), where we perform reinforcement learning on variants of test problems at inference time. TTRL enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of 90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1's performance.That's incredible!

评论 #43291896 未加载

评论 #43288258 未加载

revskill2 months ago

Llm keeps deleting my file content proved that we have far many things to do.

bloomingkales2 months ago

I’m kinda getting the sense this is still just prompt engineering in a loop.Persona-based prompting: We prompted the model to adopt different mathematical perspectives (e.g., "think like Euler focusing on series", "approach like Gauss looking for patterns").I mean … I guess that’s scientific?Besides that, how can the model learn at test time (at inferencing)?. It’s stateless, it doesn’t incorporate the last prompt into the model.

评论 #43288647 未加载

ma9o2 months ago

divide and conquer :)

22 comments

EMIRELADERO2 months ago

评论 #43288513 未加载

评论 #43288954 未加载

评论 #43291034 未加载

评论 #43290282 未加载

评论 #43290080 未加载

评论 #43288569 未加载

评论 #43291875 未加载

评论 #43288332 未加载

评论 #43293391 未加载

isaacfrond2 months ago

Reminds me of a quote by famous number theoretic mathematician Hendrik Lenstra:For every problem you can't solve, there's a simpler problem that you also can't solve.

评论 #43289594 未加载

评论 #43289839 未加载

barteloniu2 months ago

评论 #43290162 未加载

mentalgear2 months ago

> We demonstrate LADDER's effectiveness in the subject of mathematical integration, improving Llama 3.2 3B's accuracy from 1% to 82% on undergraduate-level problems

评论 #43288800 未加载

niemandhier2 months ago

Frank Herbert knew it: This is basically an implementation of the mentats recursive self inspection described in Dune.

Davidzheng2 months ago

评论 #43288244 未加载

neoneye22 months ago

Sidenote: `Tufa Labs` team includes the `MindsAI` team of ARC-AGI fame. <a href="https://tufalabs.ai/team.html" rel="nofollow">https://tufalabs.ai/team.html</a>

评论 #43291098 未加载

pyryt2 months ago

Some names are just too tempting <a href="https://arxiv.org/abs/1507.02672" rel="nofollow">https://arxiv.org/abs/1507.02672</a>

thomasahle2 months ago

评论 #43291088 未加载

vessenes2 months ago

mentalgear2 months ago

评论 #43288373 未加载

评论 #43288147 未加载

flakiness2 months ago

Off topic, but their site is lovely: <a href="https://tufalabs.ai/index.html" rel="nofollow">https://tufalabs.ai/index.html</a> It feels like a gold rush for sure.

cratermoon2 months ago

How many rungs of a ladder would you be willing to climb if you knew that each rung was made from half the previous rung?

daxfohl2 months ago

评论 #43296523 未加载

nis0s2 months ago

explosion-s2 months ago

I would love to be able to use the actual model! If I'm understanding correctly this makes small models as intelligent as much larger models like GPT4o

evjan2 months ago

goyel2 months ago

I wonder why nobody made a NN to find the weigths faster and better than gradient descent

majordroid2 months ago

评论 #43291896 未加载

评论 #43288258 未加载

revskill2 months ago

Llm keeps deleting my file content proved that we have far many things to do.

bloomingkales2 months ago

评论 #43288647 未加载

ma9o2 months ago

divide and conquer :)