They're distilling a reasoning model, using a llama model as a base. But they're using RL instead of SFT:<p><pre><code> Reinforcement Learning (RL) Training: In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Through this adaptive learning process, the model refines its predictions to enhance quality.
</code></pre>
I'm curious:<p>1. How do they determine 'closely aligned'?<p>2. How does the performance of this RL approach compare with SFT using the same base model and same dataset?