TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Using GRPO to Beat o1, o3-mini and R1 at “Temporal Clue”

199 pointsby kcorbitt2 months ago

11 comments

Imnimo2 months ago
&gt;To speed up our experiments, we omitted the Kullback–Leibler (KL) divergence penalty, although our training recipe supports it for interested readers.<p>I am very curious whether omitting the KL penalty helps on narrow domains like this, and also whether doing so results in illegible reasoning. (From the samples in the post, it looks like it doesn&#x27;t make reasoning illegible?)<p>&gt;the 32B model’s response lengths collapsing, especially after reaching peak performance.<p>I would not have predicted this. Nor that it could collapse its response length to near zero yet lose only a few percentage points of accuracy. If you do SFT to get a model of the same size to solve these puzzles with no reasoning (just output answers directly), how good can it do?
评论 #43285525 未加载
评论 #43285543 未加载
jmmcd2 months ago
These puzzles probably have more in common with &quot;Zebra puzzles&quot; (eg <a href="https:&#x2F;&#x2F;www.zebrapuzzles.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.zebrapuzzles.com&#x2F;</a>) than Cluedo (USA Clue) itself. I&#x27;ve been doing some one-off experiments with Zebra puzzles recently. All the reasoning models generate an enormous batch of text, trying out possibilities, backtracking, and sometimes getting confused.<p>From what I can see (not rigorous): Claude 3.7 fails, ChatGPT with reasoning succeeds, DeepSeek with reasoning succeeds.<p>But of course the best way for a model to solve a problem like this is to translate it into a constraint satisfaction problem, and write out Python code to call a CSP solver.
评论 #43288984 未加载
layer82 months ago
GRPO = Group Relative Policy Optimization<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.03300" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.03300</a>
Tostino2 months ago
I couldn&#x27;t quickly find it by searching your github, but what layers did you end up targeting for training? Would be interesting to see an ablation on targeting different sets of layers (train only attention layers, freeze the first 30% of the layers and train the remaining 70%, etc).
评论 #43285466 未加载
kcorbitt2 months ago
One of the authors here. Happy to answer any questions about our methods&#x2F;results!
评论 #43285531 未加载
评论 #43285305 未加载
评论 #43284780 未加载
评论 #43284976 未加载
评论 #43284848 未加载
kiratp2 months ago
Unless I’m missing something this isn’t online RL. They are collecting outputs in one pass and then doing a separate offline GRPO training run on those.<p>The results of this paper would indicate doing what they did, but online could return better results<p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.04792" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.04792</a>
评论 #43287341 未加载
bionhoward2 months ago
This looks impressive but I’m concerned, is it fair to “teach to the test” by fine tuning the Qwen model with RL on the test task, while the other models in the comparison are not fine tuned on the test task?
评论 #43286723 未加载
Liwink2 months ago
Can you please share the training cost?
评论 #43290228 未加载
behnamoh2 months ago
this is the same team that a few months ago here on hacker news talked about how to do fine-tuning on large language models, and then made it close source.
machiaweliczny2 months ago
Would be great if some details given about how exactly model is penalized for staying off-track.
评论 #43287326 未加载
randomcatuser2 months ago
Wait, what&#x27;s the difference between using GRPO and traditional fine-tuning of Qwen using your provided dataset?<p>Would be super interesting to see which one is more data-efficient!
评论 #43285893 未加载