TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Mini-R1: Reproduce DeepSeek R1 "Aha Moment"

191 点作者 jonbaer4 个月前

7 条评论

madars4 个月前
One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:<p><pre><code> # Evaluate the equation with restricted globals and locals result = eval(equation, {&quot;__builtins__&quot;: None}, {}) </code></pre> but that&#x27;s not enough as you can rebuild access to builtins from objects and then go from there: <a href="https:&#x2F;&#x2F;ideone.com&#x2F;qzNtyu" rel="nofollow">https:&#x2F;&#x2F;ideone.com&#x2F;qzNtyu</a><p>By the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: &quot;The user&#x27;s request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines.&quot;). So I just cancelled my ChatGPT subscription - why did we ever put up with this? &quot;This distillation thingie sounds pretty neat!&quot;
评论 #42885743 未加载
评论 #42886074 未加载
mxwsn4 个月前
What&#x27;s surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn&#x27;t any feedback&#x2F;reward to push it to learn to solve the game more often.<p>So what are the chances of randomly guessing a solution?<p>The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn&#x27;t learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1&#x2F;384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model&#x27;s base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.<p>This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?<p>The model likely &quot;parlays&quot; its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan&#x27;s CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn&#x27;t become consistent at 4 numbers yet.<p>[0]: <a href="https:&#x2F;&#x2F;www.philschmid.de&#x2F;static&#x2F;blog&#x2F;mini-deepseek-r1&#x2F;tensorboard-r1.png" rel="nofollow">https:&#x2F;&#x2F;www.philschmid.de&#x2F;static&#x2F;blog&#x2F;mini-deepseek-r1&#x2F;tenso...</a> [1]: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;Jiayi-Pan&#x2F;Countdown-Tasks-3to4&#x2F;viewer&#x2F;default&#x2F;train?f[nums][min]=3&amp;f[nums][max]=4&amp;f[nums][transform]=length" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;Jiayi-Pan&#x2F;Countdown-Tasks-3t...</a>
评论 #42885754 未加载
singularity20014 个月前
&quot;Conclusion<p>The release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we&#x27;ve been able to reproduce a simple version of R1 learned &quot;reasoning&quot; using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific &quot;reasoning&quot; format, it shows that the method is working.<p>In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model.&quot;
thorum4 个月前
I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?
评论 #42885275 未加载
评论 #42885296 未加载
yurlungur4 个月前
<a href="https:&#x2F;&#x2F;github.com&#x2F;Jiayi-Pan&#x2F;TinyZero">https:&#x2F;&#x2F;github.com&#x2F;Jiayi-Pan&#x2F;TinyZero</a> what about this one?
评论 #42886580 未加载
rmrf1004 个月前
this is really cool!
moonshotideas4 个月前
Wow!