科技回声

3 条评论

This is what I understood from the blog post (please correct me if I am wrong):Unsloth allows you to give it a transformer model and additional training data to do LoRA/QLoRA. LoRA/QLoRA will keep the weights of the model as constant but as output some low rank adjustments to the weights which serves as the weight "delta".Typically one would do SFT with the training data. But Unsloth allows you to do RL (Reinforcement learning) specifically GRPO on the model + training data you give it also! The output of the GRPO here is again in the form the LoRA/QLoRA weights.You have found a way to reduce the memory requirements for GRPO.Question: How does one decide whether the training data will be SFT (Supervised fine tuning) or GRPO ? When will you get better results with SFT and when with GRPO ?

评论 #43125244 未加载

评论 #43128873 未加载

yorwba3 个月前

> We also found interestingly that:<pre><code> torch.exp(q - q.detach()) * advantages.unsqueeze(1) </code></pre> > is used, which should be evaluated to 1 right? We actually found this is necessary - it seems that the autograd engine might not be propagating gradients correctly.The autograd engine is propagating gradients correctly, but the question is, which gradients?You could encapsulate this as a function<pre><code> f = lambda a, b: torch.exp(a - b) * advantages.unsqueeze(1) </code></pre> then have f_a(a, b) be the derivative of that with respect to a, and substitute in q for both variables to get f_a(q, q).But if you substitute to get f(q, q) first and then differentiate with respect to q, you don't get f_a(q, q), but instead f_a(q, q) + f_b(q, q), which in this case would be 0. The ordering of variable substitution and differentiation cannot be exchanged freely.detach() is a way to say "we want to differentiate the expression first, treating this as a constant, and then substitute with this variable afterwards."

danielhanchen3 个月前

Oh thanks for posting! If anyone has any questions about stuff, feel free to ask!

评论 #43128229 未加载

评论 #43126038 未加载

3 条评论

sidkshatriya3 个月前

评论 #43125244 未加载

评论 #43128873 未加载

yorwba3 个月前

danielhanchen3 个月前

Oh thanks for posting! If anyone has any questions about stuff, feel free to ask!

评论 #43128229 未加载

评论 #43126038 未加载

Long-Context GRPO

3 条评论

Long-Context GRPO

3 条评论