What's surprising about this is how sparsely defined the rewards are. Even if the model learns the formatting reward, if it never chances upon a solution, there isn't any feedback/reward to push it to learn to solve the game more often.<p>So what are the chances of randomly guessing a solution?<p>The toy Countdown dataset here has 3 to 4 numbers, which are combined with 4 symbols (+, -, x, ÷). With 3 numbers there are 3! * 4^3 = 384 possible symbol combinations, with 4 there are 6144. By the tensorboard log [0], even after just 10 learning steps, the model already has a success rate just below 10%. If we make the simplifying assumption that the model hasn't learned anything in 10 steps, then the probability of 1 (or more) success in 80 chances (8 generations are used per step), guessing randomly for a success rate of 1/384 on 3-number problems, is 1.9%. One interpretation is to take this as a p-value, and reject that the model's base success rate is completely random guessing - the base model already has slightly above chance success rate at solving the 3-number CountDown game.<p>This aligns with my intuition - I suspect that with proper prompting, LLMs should be able to solve CountDown decently OK without any training. Though maybe not a 3B model?<p>The model likely "parlays" its successes on 3 numbers to start to learn to solve 4 numbers. Or has it? The final learned ~50% success rate matches the frequency of 4-number problems in Jiayi Pan's CountDown dataset [1]. Phil does provide examples of successful 4-number solutions, but maybe the model hasn't become consistent at 4 numbers yet.<p>[0]: <a href="https://www.philschmid.de/static/blog/mini-deepseek-r1/tensorboard-r1.png" rel="nofollow">https://www.philschmid.de/static/blog/mini-deepseek-r1/tenso...</a>
[1]: <a href="https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3to4/viewer/default/train?f[nums][min]=3&f[nums][max]=4&f[nums][transform]=length" rel="nofollow">https://huggingface.co/datasets/Jiayi-Pan/Countdown-Tasks-3t...</a>