This is what I understood from the blog post (please correct me if I am wrong):<p>Unsloth allows you to give it a transformer model and additional training data to do LoRA/QLoRA. LoRA/QLoRA will keep the weights of the model as constant but as output some low rank adjustments to the weights which serves as the weight "delta".<p>Typically one would do SFT with the training data. But Unsloth allows you to do RL (Reinforcement learning) specifically GRPO on the model + training data you give it also! The output of the GRPO here is again in the form the LoRA/QLoRA weights.<p>You have found a way to reduce the memory requirements for GRPO.<p>Question:
How does one decide whether the training data will be SFT (Supervised fine tuning) or GRPO ? When will you get better results with SFT and when with GRPO ?
> We also found interestingly that:<p><pre><code> torch.exp(q - q.detach()) * advantages.unsqueeze(1)
</code></pre>
> is used, which should be evaluated to 1 right? We actually found this is necessary - it seems that the autograd engine might not be propagating gradients correctly.<p>The autograd engine is propagating gradients correctly, but the question is, which gradients?<p>You could encapsulate this as a function<p><pre><code> f = lambda a, b: torch.exp(a - b) * advantages.unsqueeze(1)
</code></pre>
then have <i>f_a(a, b)</i> be the derivative of that with respect to <i>a</i>, and substitute in <i>q</i> for both variables to get <i>f_a(q, q)</i>.<p>But if you substitute to get <i>f(q, q)</i> first and then differentiate with respect to <i>q</i>, you don't get <i>f_a(q, q)</i>, but instead <i>f_a(q, q) + f_b(q, q)</i>, which in this case would be 0. The ordering of variable substitution and differentiation cannot be exchanged freely.<p><i>detach()</i> is a way to say "we want to differentiate the expression first, treating this as a constant, and then substitute with this variable afterwards."