TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Long-Context GRPO

60 pointsby veryluckyxyz3 months ago

3 comments

sidkshatriya3 months ago
This is what I understood from the blog post (please correct me if I am wrong):<p>Unsloth allows you to give it a transformer model and additional training data to do LoRA&#x2F;QLoRA. LoRA&#x2F;QLoRA will keep the weights of the model as constant but as output some low rank adjustments to the weights which serves as the weight &quot;delta&quot;.<p>Typically one would do SFT with the training data. But Unsloth allows you to do RL (Reinforcement learning) specifically GRPO on the model + training data you give it also! The output of the GRPO here is again in the form the LoRA&#x2F;QLoRA weights.<p>You have found a way to reduce the memory requirements for GRPO.<p>Question: How does one decide whether the training data will be SFT (Supervised fine tuning) or GRPO ? When will you get better results with SFT and when with GRPO ?
评论 #43125244 未加载
评论 #43128873 未加载
yorwba3 months ago
&gt; We also found interestingly that:<p><pre><code> torch.exp(q - q.detach()) * advantages.unsqueeze(1) </code></pre> &gt; is used, which should be evaluated to 1 right? We actually found this is necessary - it seems that the autograd engine might not be propagating gradients correctly.<p>The autograd engine is propagating gradients correctly, but the question is, which gradients?<p>You could encapsulate this as a function<p><pre><code> f = lambda a, b: torch.exp(a - b) * advantages.unsqueeze(1) </code></pre> then have <i>f_a(a, b)</i> be the derivative of that with respect to <i>a</i>, and substitute in <i>q</i> for both variables to get <i>f_a(q, q)</i>.<p>But if you substitute to get <i>f(q, q)</i> first and then differentiate with respect to <i>q</i>, you don&#x27;t get <i>f_a(q, q)</i>, but instead <i>f_a(q, q) + f_b(q, q)</i>, which in this case would be 0. The ordering of variable substitution and differentiation cannot be exchanged freely.<p><i>detach()</i> is a way to say &quot;we want to differentiate the expression first, treating this as a constant, and then substitute with this variable afterwards.&quot;
danielhanchen3 months ago
Oh thanks for posting! If anyone has any questions about stuff, feel free to ask!
评论 #43128229 未加载
评论 #43126038 未加载