Reposting the comment from <a href="https://news.ycombinator.com/item?id=42843959">https://news.ycombinator.com/item?id=42843959</a>:<p>This is blogspam of <a href="https://github.com/Jiayi-Pan/TinyZero">https://github.com/Jiayi-Pan/TinyZero</a> and <a href="https://nitter.lucabased.xyz/jiayi_pirate/status/1882839370505621655" rel="nofollow">https://nitter.lucabased.xyz/jiayi_pirate/status/18828393705...</a>. This also doesn't mention that it's for one specific domain (playing Countdown).
See also <a href="https://news.ycombinator.com/item?id=42819262">https://news.ycombinator.com/item?id=42819262</a>.
If the current hubbub around DeepSeen is really because they ”created their model” with like $5M when previously ”creating a model” cost $500B, it is rather obvious that ”creating the model” with just $30 implies the meanings of the three ”creating a model” expressions are highly divergent.
@dang please link to either the GitHub <a href="https://github.com/Jiayi-Pan/TinyZero">https://github.com/Jiayi-Pan/TinyZero</a><p>or the primary source twitter thread: <a href="https://x.com/jiayi_pirate/status/1882839370505621655" rel="nofollow">https://x.com/jiayi_pirate/status/1882839370505621655</a>
First graph tells the story - below a certain model size (500m params), reinforcement learning is close to useless. Above this (task-dependent) model size threshold, reinforcement learning basically works.<p>I suspect this is what we saw play out with math/coding reasoning models - until recently, the base models were not good enough for ~random output search to hit on a correct path with any reasonable frequency. Below this threshold of base model intelligence, the only efficient way forward was to collect plain supervised data (either through human labeled math problem solutions [1] or meticulous filtering of web text [2].<p>But as soon the base model (in this case Deepseek V3) breaks through and can actually solve a decent fraction of math problems, then reinforcement learning (plus other simple tricks like chain-of-thought prompting, simple ensemble voting, etc.) can easily juice the results through the following loop:<p>1) random search through different solution paths<p>2) identify the correct solution paths based on the final answer<p>3) train on the correct solution paths<p>The exciting thing is that not only can RL bump up the performance of the current base model, but it can be used to generate new high-quality reasoning trace data, which was in painfully short-supply for training the initial models. This leads to a new wave of base models with better one-pass intuition, which leads to more efficient reinforcement learning search on harder problems, which leads to better training data...<p>Note that this was basically impossible for non-LLM models in the past. You could always juice ImageNet classification performance with a simple ensemble of identically trained models, but that path didn't lead anywhere interesting because a juiced model didn't allow the creation of new synthetic data that was superior to the data it was trained on. The key difference is that LLMs not only output the solution but also output a solution <i>path</i> with all the intermediate steps - and these searched-and-filtered solution paths are much more valuable than the vast majority of the model's initial training data.<p>[1] <a href="https://arxiv.org/abs/2305.20050" rel="nofollow">https://arxiv.org/abs/2305.20050</a><p>[2] <a href="https://arxiv.org/abs/2402.03300" rel="nofollow">https://arxiv.org/abs/2402.03300</a> and <a href="https://arxiv.org/abs/2206.14858" rel="nofollow">https://arxiv.org/abs/2206.14858</a>
This is truly the biggest breakthrough from DeepSeek - that an LLM can teach itself to reason, no human feedback needed.<p>That’s nuts and brings forward the idea that an AI is close to self improvement.
'replication' requires matching benchmark performance, definitionally.<p>more like 'demonstrates the technique generalizes' here. HN has really been inundated with blogspam recently
Unless I missed it, it seems strange that the article wouldn’t link to the Github repo for the TinyZero model.<p><a href="https://github.com/Jiayi-Pan/TinyZero">https://github.com/Jiayi-Pan/TinyZero</a>
Guys I'm sorry but it appears the substack did NOT link to the original authors which is NOT acceptable!<p>Credit<p>GitHub:
<a href="https://github.com/Jiayi-Pan/TinyZero">https://github.com/Jiayi-Pan/TinyZero</a><p>Source on X:
<a href="https://x.com/jiayi_pirate/status/1882839370505621655" rel="nofollow">https://x.com/jiayi_pirate/status/1882839370505621655</a>
Would it be correct to summarize that the general conceptual shift is optimizing MOEs on more specific smaller tasks? It smells like borderline overfitting to me for some reason.
"TinyZero is a reproduction of DeepSeek R1 Zero in countdown and multiplication tasks." Does that mean that this has very limited utility (to certain math problems)?
Good to know that our AI overlords will be built as cheaply as possible. If there's one thing I can't stand about bondage it's inefficiency.