TechEcho

11 comments

Imnimo2 months ago

>To speed up our experiments, we omitted the Kullback–Leibler (KL) divergence penalty, although our training recipe supports it for interested readers.I am very curious whether omitting the KL penalty helps on narrow domains like this, and also whether doing so results in illegible reasoning. (From the samples in the post, it looks like it doesn't make reasoning illegible?)>the 32B model’s response lengths collapsing, especially after reaching peak performance.I would not have predicted this. Nor that it could collapse its response length to near zero yet lose only a few percentage points of accuracy. If you do SFT to get a model of the same size to solve these puzzles with no reasoning (just output answers directly), how good can it do?

评论 #43285525 未加载

评论 #43285543 未加载

jmmcd2 months ago

These puzzles probably have more in common with "Zebra puzzles" (eg <a href="https://www.zebrapuzzles.com/" rel="nofollow">https://www.zebrapuzzles.com/</a>) than Cluedo (USA Clue) itself. I've been doing some one-off experiments with Zebra puzzles recently. All the reasoning models generate an enormous batch of text, trying out possibilities, backtracking, and sometimes getting confused.From what I can see (not rigorous): Claude 3.7 fails, ChatGPT with reasoning succeeds, DeepSeek with reasoning succeeds.But of course the best way for a model to solve a problem like this is to translate it into a constraint satisfaction problem, and write out Python code to call a CSP solver.

评论 #43288984 未加载

layer82 months ago

GRPO = Group Relative Policy Optimization<a href="https://arxiv.org/abs/2402.03300" rel="nofollow">https://arxiv.org/abs/2402.03300</a>

Tostino2 months ago

I couldn't quickly find it by searching your github, but what layers did you end up targeting for training? Would be interesting to see an ablation on targeting different sets of layers (train only attention layers, freeze the first 30% of the layers and train the remaining 70%, etc).

评论 #43285466 未加载

kcorbitt2 months ago

One of the authors here. Happy to answer any questions about our methods/results!

评论 #43285531 未加载

评论 #43285305 未加载

评论 #43284780 未加载

评论 #43284976 未加载

评论 #43284848 未加载

kiratp2 months ago

Unless I’m missing something this isn’t online RL. They are collecting outputs in one pass and then doing a separate offline GRPO training run on those.The results of this paper would indicate doing what they did, but online could return better results<a href="https://arxiv.org/abs/2402.04792" rel="nofollow">https://arxiv.org/abs/2402.04792</a>

评论 #43287341 未加载

bionhoward2 months ago

This looks impressive but I’m concerned, is it fair to “teach to the test” by fine tuning the Qwen model with RL on the test task, while the other models in the comparison are not fine tuned on the test task?

评论 #43286723 未加载

Liwink2 months ago

Can you please share the training cost?

评论 #43290228 未加载

behnamoh2 months ago

this is the same team that a few months ago here on hacker news talked about how to do fine-tuning on large language models, and then made it close source.

machiaweliczny2 months ago

Would be great if some details given about how exactly model is penalized for staying off-track.

评论 #43287326 未加载

randomcatuser2 months ago

Wait, what's the difference between using GRPO and traditional fine-tuning of Qwen using your provided dataset?Would be super interesting to see which one is more data-efficient!

评论 #43285893 未加载

11 comments

Imnimo2 months ago

评论 #43285525 未加载

评论 #43285543 未加载

jmmcd2 months ago

评论 #43288984 未加载

layer82 months ago

GRPO = Group Relative Policy Optimization<a href="https://arxiv.org/abs/2402.03300" rel="nofollow">https://arxiv.org/abs/2402.03300</a>

Tostino2 months ago

评论 #43285466 未加载

kcorbitt2 months ago

One of the authors here. Happy to answer any questions about our methods/results!

评论 #43285531 未加载

评论 #43285305 未加载

评论 #43284780 未加载

评论 #43284976 未加载

评论 #43284848 未加载

kiratp2 months ago

评论 #43287341 未加载

bionhoward2 months ago

评论 #43286723 未加载

Liwink2 months ago

Can you please share the training cost?

评论 #43290228 未加载

behnamoh2 months ago

this is the same team that a few months ago here on hacker news talked about how to do fine-tuning on large language models, and then made it close source.

machiaweliczny2 months ago

Would be great if some details given about how exactly model is penalized for staying off-track.

评论 #43287326 未加载

randomcatuser2 months ago

Wait, what's the difference between using GRPO and traditional fine-tuning of Qwen using your provided dataset?Would be super interesting to see which one is more data-efficient!

评论 #43285893 未加载

Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”

11 comments

Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”

11 comments