>To speed up our experiments, we omitted the Kullback–Leibler (KL) divergence penalty, although our training recipe supports it for interested readers.<p>I am very curious whether omitting the KL penalty helps on narrow domains like this, and also whether doing so results in illegible reasoning. (From the samples in the post, it looks like it doesn't make reasoning illegible?)<p>>the 32B model’s response lengths collapsing, especially after reaching peak performance.<p>I would not have predicted this. Nor that it could collapse its response length to near zero yet lose only a few percentage points of accuracy. If you do SFT to get a model of the same size to solve these puzzles with no reasoning (just output answers directly), how good can it do?
These puzzles probably have more in common with "Zebra puzzles" (eg <a href="https://www.zebrapuzzles.com/" rel="nofollow">https://www.zebrapuzzles.com/</a>) than Cluedo (USA Clue) itself. I've been doing some one-off experiments with Zebra puzzles recently. All the reasoning models generate an enormous batch of text, trying out possibilities, backtracking, and sometimes getting confused.<p>From what I can see (not rigorous): Claude 3.7 fails, ChatGPT with reasoning succeeds, DeepSeek with reasoning succeeds.<p>But of course the best way for a model to solve a problem like this is to translate it into a constraint satisfaction problem, and write out Python code to call a CSP solver.
I couldn't quickly find it by searching your github, but what layers did you end up targeting for training? Would be interesting to see an ablation on targeting different sets of layers (train only attention layers, freeze the first 30% of the layers and train the remaining 70%, etc).
Unless I’m missing something this isn’t online RL. They are collecting outputs in one pass and then doing a separate offline GRPO training run on those.<p>The results of this paper would indicate doing what they did, but online could return better results<p><a href="https://arxiv.org/abs/2402.04792" rel="nofollow">https://arxiv.org/abs/2402.04792</a>
This looks impressive but I’m concerned, is it fair to “teach to the test” by fine tuning the Qwen model with RL on the test task, while the other models in the comparison are not fine tuned on the test task?
this is the same team that a few months ago here on hacker news talked about how to do fine-tuning on large language models, and then made it close source.
Wait, what's the difference between using GRPO and traditional fine-tuning of Qwen using your provided dataset?<p>Would be super interesting to see which one is more data-efficient!