> We also found interestingly that:<p><pre><code> torch.exp(q - q.detach()) * advantages.unsqueeze(1)
</code></pre>
> is used, which should be evaluated to 1 right? We actually found this is necessary - it seems that the autograd engine might not be propagating gradients correctly.<p>The autograd engine is propagating gradients correctly, but the question is, which gradients?<p>You could encapsulate this as a function<p><pre><code> f = lambda a, b: torch.exp(a - b) * advantages.unsqueeze(1)
</code></pre>
then have <i>f_a(a, b)</i> be the derivative of that with respect to <i>a</i>, and substitute in <i>q</i> for both variables to get <i>f_a(q, q)</i>.<p>But if you substitute to get <i>f(q, q)</i> first and then differentiate with respect to <i>q</i>, you don't get <i>f_a(q, q)</i>, but instead <i>f_a(q, q) + f_b(q, q)</i>, which in this case would be 0. The ordering of variable substitution and differentiation cannot be exchanged freely.<p><i>detach()</i> is a way to say "we want to differentiate the expression first, treating this as a constant, and then substitute with this variable afterwards."