I found the paper a tad difficult to understand because it spends a lot of time circling around the main thesis instead of directly describing. So, to the best of my understanding:<p>We want to improve LLM's abilities to give correct answers to hard problems. One theory is that we can do that by training a "Self Correcting" behavior into the models where they can take as input a wrong answer and improve it to a better/correct answer.<p>This has been explored previously, trying to train this behavior using various Reinforcement techniques where the reward is based on how good the "corrected" answer is. So far it hasn't worked well, and the trained behavior doesn't generalize well.<p>The thesis of the paper is that this is because when the model is presented with a training example of `Answer 1, Reasoning, Corrected Answer`, and a signal of "Make Corrected Answer Better" it actually has _two_ perfectly viable ways to do that. One is to improve `Reasoning, Corrected Answer`, which would yield a higher reward and is what we want. The other, just as valid solution, is to simply improve `Answer 1` and have `Corrected Answer` = `Answer 1`.<p>The latter is what existing research has shown happens, and why so far attempts to train the desired behavior has failed. The models just try to improve their answers, not their correcting behaviors. This paper's solution is to change the training regimen slightly to encourage the model to use the former approach. And thus, hopefully, get the model to actually train the desired behavior of correcting previous answers.<p>This is done by doing two stages of training. In the first stage, the model is forced (by KL divergence loss) to keep its first answers the same, while being rewarded for improving the second answer. This helps keep the model's distribution of initial answers the same, avoiding the issue later where the model doesn't see as many "wrong" answers because wrong answers were trained out of the model. But it helps initialize the "self correcting" behavior into the model.<p>In the second stage the model is free to change the first answer, but they tweak the reward function to give higher rewards for "flips" (where answer 1 was bad, but answer 2 was good). So in this second stage it can use both strategies, improving its first answer or improving its self correcting, but it gets more rewards for the latter behavior. This seems to be a kind of refinement on the model, to improve things overall, while still keeping the self correcting behavior intact.<p>Anyway, blah blah blah, metrics showing the technique working better and generalizing better.<p>Seems reasonable to me. I'd be a bit worried about, in Stage 2, the model learning to write _worse_ answers for Answer 1 so it can maximize the reward for flipping answers. So you'd need some kind of balancing to ensure Answer 1 doesn't get worse. Not sure if that's in their reward function or not, or if its even a valid concern in practice.