I guess I'm bearish?<p>It's not that they <i>trained a new model</i>, but <i>they took an existing model</i> and RL'd it a bit?<p>The scores are very close to QwQ-32B, and at the end:<p>"Overall, as QwQ-32B was already extensively trained with RL, it was difficult to obtain huge amounts of generalized improvement on benchmarks beyond our improvements on the training dataset. To see stronger improvements, it is likely that better base models such as the now available Qwen3, or higher quality datasets and RL environments are needed."