While interesting, the title is obviously a bit misleading.<p>> Our results on a temporally held-out test set of questions resolving after December 25, 2024 show that for both of the models that we employed our method on, Phi-4 14B [15] and DeepSeek-R1 14B [14], we find accuracy improvements of between 7–10% over the base versions of these models as well as the same models fine-tuned with randomized outcome labels as a control<p>So 7–10% improvement for small models like DeepSeek-R1-Distill-Qwen-14B and Phi-4-14B, approaching GPT-4o.<p>It would be interesting if the same holds for DeepSeek-R1-Distill-Qwen-32B which in my experience is far superior to to DeepSeek-R1-Distill-Qwen-14B in almost every way, yet still runnable without DC class GPUs<p>The Ridge Plots of brier scores is probably a good hint if your application chan benefit based on it's tail dependence?<p>IMHO this paper is all about making small models work better, and nothing suggests anything about frontier models or LLMs in general.