The essence of the article is that self-correction exists as a nascent ability in base models already (more robustly in some like Qwen than others). This is highly reminiscent of Chain of Thought, which was found to be a capability already present in base models too. The result of RL is to reinforce already present authentic self-correction patterns and down weight superficial self-correction.<p>Thoughts:<p>- An analogy you shouldn't zoom too close into is going from CoT to reasoning traces is like going from purely ballistic trajectories to including navigation and thrusters. RL is for learning how to use the thrusters for adjustments based on its internal encodings of rare samples† where some author fully spelled out their thought process.<p>- This might also explain why SFT on reasoner traces seems to be surprisingly effective. If it were purely an RL mediated phenomenon, SFT for reasoning would not work nearly as well.<p>- Deepseek struggled to get RL to work on smaller models, if this is replicated, it might be the case that larger models encode self-correction patterns more robustly while having them as more probable.<p>- Imitating traces is easier than pure RL for bringing such patterns to the fore, for smaller models. However, we still want models to learn how to dynamically adjust their thrusters, SFT does not provide ample opportunity for this. Further training with RL or alternatively, replacing SFT with methods like [Critique Fine-Tuning](<a href="https://arxiv.org/abs/2501.17703" rel="nofollow">https://arxiv.org/abs/2501.17703</a>) are needed.<p>- The article incidentally reinforces that having a low temperature means consistency not correctness. Except for high confidence scenarios, the highest greedily computed probability answer is generally less likely to be among the best ones the model can give.<p>†Question: First thought is blogs by people who discuss what didn't work. But, I wonder how much of reasoning model patterns and ability is shaped by Detective Conan transcripts?