科技回声

7 条评论

madars4 个月前

One wonders at which point models will be sneaky enough to bypass simple eval sandboxes. The article has:<pre><code> # Evaluate the equation with restricted globals and locals result = eval(equation, {"__builtins__": None}, {}) </code></pre> but that's not enough as you can rebuild access to builtins from objects and then go from there: <a href="https://ideone.com/qzNtyu" rel="nofollow">https://ideone.com/qzNtyu</a>By the way, writing this greatly benefited from DeepThink-r1 while o1 just gave me a lobotomized refusal (CoT: "The user's request involves injecting code to bypass a restricted Python environment, suggesting a potential interest in illegal activities. This is a serious matter and aligns closely with ethical guidelines."). So I just cancelled my ChatGPT subscription - why did we ever put up with this? "This distillation thingie sounds pretty neat!"

评论 #42885743 未加载

评论 #42886074 未加载

mxwsn4 个月前

What's surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn't any feedback/reward to push it to learn to solve the game more often.So what are the chances of randomly guessing a solution?The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn't learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1/384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model's base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?The model likely "parlays" its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan's CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn't become consistent at 4 numbers yet.[0]: <a href="https://www.philschmid.de/static/blog/mini-deepseek-r1/tensorboard-r1.png" rel="nofollow">https://www.philschmid.de/static/blog/mini-deepseek-r1/tenso...</a> [1]: <a href="https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4/viewer/default/train?f[nums][min]=3&f[nums][max]=4&f[nums][transform]=length" rel="nofollow">https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3t...</a>

评论 #42885754 未加载

singularity20014 个月前

"ConclusionThe release of DeepSeek R1 and its research paper might be breakpoint for the open-science and open-source development. Just a week after DeepSeek release, we've been able to reproduce a simple version of R1 learned "reasoning" using GRPO and the Countdown Game. While our implementation focuses on a specific task rather than general reasoning and convergence into a very specific "reasoning" format, it shows that the method is working.In our mini R1 experiment we used GRPO, with two rule-based reward but already required significant compute: 4 H100 GPUs running for 6 hours to complete just 450 training steps on a 3B parameter model."

thorum4 个月前

I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?

评论 #42885275 未加载

评论 #42885296 未加载

yurlungur4 个月前

<a href="https://github.com/Jiayi-Pan/TinyZero">https://github.com/Jiayi-Pan/TinyZero</a> what about this one?

评论 #42886580 未加载

rmrf1004 个月前

this is really cool!

moonshotideas4 个月前

Wow!

7 条评论

madars4 个月前

评论 #42885743 未加载

评论 #42886074 未加载

mxwsn4 个月前

评论 #42885754 未加载

singularity20014 个月前

thorum4 个月前

I was getting pretty hyped about the potential for GRPO in my own projects until you said 20 minutes for a single training step with batch size 1! Is that likely to improve?

评论 #42885275 未加载

评论 #42885296 未加载

yurlungur4 个月前

<a href="https://github.com/Jiayi-Pan/TinyZero">https://github.com/Jiayi-Pan/TinyZero</a> what about this one?

评论 #42886580 未加载

rmrf1004 个月前

this is really cool!

moonshotideas4 个月前

Wow!

Mini-R1: Reproduce DeepSeek R1 "Aha Moment"

7 条评论

Mini-R1: Reproduce DeepSeek R1 "Aha Moment"

7 条评论