1 点作者 delifue大约 2 个月前

1 comment

delifue大约 2 个月前

From: <a href="https://x.com/zzlccc/status/1903162768083259703" rel="nofollow">https://x.com/zzlccc/status/1903162768083259703</a><p>DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning<p>The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO

Understanding R1-Zero-Like Training: A Critical Perspective [pdf]

1 comment

Understanding R1-Zero-Like Training: A Critical Perspective [pdf]

1 comment