This is a really exciting development. They’re matching Qwen 2.5 32B on 1/3 the compute budget.<p>> Refined post-training and RLVR: Our models integrate our latest breakthrough in reinforcement learning with verifiable rewards (RLVR) as part of the Tülu 3.1 recipe by using Group Relative Policy Optimization (GRPO) and improved training infrastructure further enhancing their capabilities.<p>I only recently discovered all the work AI2 put out with Tülu 3, really laying out all of the components that make up a state-of-the-art post-training data mix. Very interesting stuff!<p><a href="https://allenai.org/blog/tulu-3-technical" rel="nofollow">https://allenai.org/blog/tulu-3-technical</a>