> we uncover that RL-trained models excel at low k (e.g., pass@1) but are consistently outperformed by base models at high k (e.g., pass@256).<p>This is a weak argument. I think I get what we are trying to say, but let's take this to the extreme, say pass@10^10^100. Just like a group of monkeys could write Shakespeare if given enough time, a complete random model could probably outperform an RL-trained model at pass@10^10^100. Would we then say the random model can reason too?<p>Of course the correct reasoning trace will be in the base model's distribution, just like any other well-formed, coherent paragraph. Kind of makes me think, maybe sampling efficiency _is_ intelligence?