TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Long-Context GRPO

60 点作者 veryluckyxyz3 个月前

3 条评论

sidkshatriya3 个月前
This is what I understood from the blog post (please correct me if I am wrong):<p>Unsloth allows you to give it a transformer model and additional training data to do LoRA&#x2F;QLoRA. LoRA&#x2F;QLoRA will keep the weights of the model as constant but as output some low rank adjustments to the weights which serves as the weight &quot;delta&quot;.<p>Typically one would do SFT with the training data. But Unsloth allows you to do RL (Reinforcement learning) specifically GRPO on the model + training data you give it also! The output of the GRPO here is again in the form the LoRA&#x2F;QLoRA weights.<p>You have found a way to reduce the memory requirements for GRPO.<p>Question: How does one decide whether the training data will be SFT (Supervised fine tuning) or GRPO ? When will you get better results with SFT and when with GRPO ?
评论 #43125244 未加载
评论 #43128873 未加载
yorwba3 个月前
&gt; We also found interestingly that:<p><pre><code> torch.exp(q - q.detach()) * advantages.unsqueeze(1) </code></pre> &gt; is used, which should be evaluated to 1 right? We actually found this is necessary - it seems that the autograd engine might not be propagating gradients correctly.<p>The autograd engine is propagating gradients correctly, but the question is, which gradients?<p>You could encapsulate this as a function<p><pre><code> f = lambda a, b: torch.exp(a - b) * advantages.unsqueeze(1) </code></pre> then have <i>f_a(a, b)</i> be the derivative of that with respect to <i>a</i>, and substitute in <i>q</i> for both variables to get <i>f_a(q, q)</i>.<p>But if you substitute to get <i>f(q, q)</i> first and then differentiate with respect to <i>q</i>, you don&#x27;t get <i>f_a(q, q)</i>, but instead <i>f_a(q, q) + f_b(q, q)</i>, which in this case would be 0. The ordering of variable substitution and differentiation cannot be exchanged freely.<p><i>detach()</i> is a way to say &quot;we want to differentiate the expression first, treating this as a constant, and then substitute with this variable afterwards.&quot;
danielhanchen3 个月前
Oh thanks for posting! If anyone has any questions about stuff, feel free to ask!
评论 #43128229 未加载
评论 #43126038 未加载