They're distilling a reasoning model, using a llama model as a base. But they're using RL instead of SFT:<p><pre><code> Reinforcement Learning (RL) Training: In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Through this adaptive learning process, the model refines its predictions to enhance quality.
</code></pre>
I'm curious:<p>1. How do they determine 'closely aligned'?<p>2. How does the performance of this RL approach compare with SFT using the same base model and same dataset?
I'm curious how they evaluate the responses in the first place. This is the part replacing human annotation (which seems to be the cornerstone of their method) yet no detail is provided.