TechEcho

11 comments

spwa4about 1 month ago

I don't like papers that ask a question in the title, so here's the answer:"RL boosts sampling efficiency but reduces the reasoning capacity boundary."Perhaps better to put it like this: Given one, or few attempts, RL trained models beat non-RL models. Given many attempts, non-RL models come up with better answers.

评论 #43761664 未加载

评论 #43762528 未加载

yorwbaabout 1 month ago

They write "We manually inspect CoT validity to ensure correct answers stem from valid reasoning, not lucky guesses." but the example answer they show at the end only gets the correct number due to two errors canceling out. The model calculates 195+367+562+900 and gets 1924 instead of 2024, and also turns -437 - 2*234 into -805 instead of -905, but in total 1924-805 = 2024-905 = 1119 and from there the remaining steps are correct again.It would be interesting to know how much of the sampling efficiency improvement from reinforcement learning is due to being better at basic arithmetic (something which could also be achieved by giving the model access to a calculator tool) and how much is due to choosing the correct approach for solving the problem more often.

nialv7about 1 month ago

> we uncover that RL-trained models excel at low k (e.g., pass@1) but are consistently outperformed by base models at high k (e.g., pass@256).This is a weak argument. I think I get what we are trying to say, but let's take this to the extreme, say pass@10^10^100. Just like a group of monkeys could write Shakespeare if given enough time, a complete random model could probably outperform an RL-trained model at pass@10^10^100. Would we then say the random model can reason too?Of course the correct reasoning trace will be in the base model's distribution, just like any other well-formed, coherent paragraph. Kind of makes me think, maybe sampling efficiency _is_ intelligence?

评论 #43764415 未加载

评论 #43766657 未加载

iceman_wabout 1 month ago

RL constrains the space of possible output token sequences to what is likely to lead to the correct answer. So we are inherently making a trade-off to reduce variance. A non-RL model will have higher variance, so given enough attempts, it will come up with some correct answers that an RL model can't.

KTibowabout 1 month ago

I'm a bit skeptical of this until it's proven that they're getting the right answers in the right ways. It could be that base models are just more random and when given 200 guesses out of 1000 possible answers tend to distribute them more evenly, bringing up the pass@k number.

评论 #43763471 未加载

macleginnabout 1 month ago

‘Crucially, all correct solutions from RL-trained models already exist in the base model's distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.’ — wouldn't any kind of RL fail to converge or even progress at all if the solution weren't to be found in the base model distribution? The way training is set up, the models absolutely need to be able to find right solutions in a reasonable time, otherwis there wouldn't be any training signal.

评论 #43762725 未加载

评论 #43768096 未加载

imtringuedabout 1 month ago

>Our key finding is that all reasoning paths in the RLVR model are already present in the base model.This is a really good observation. It means that you don't need to RL the full model. You merely need to RL a few LoRAs or maybe a small Mamba model appended to the final layer.

评论 #43770011 未加载

imenaniabout 1 month ago

They fix the temperature at T=0.6 for all k for all models, even though their own Figure 10 shows that RL model benefits from higher temperatures. I would buy the overall claim much more if they swept of temperature parameter for each k and model like they did in the Codex paper [1].[1] <a href="https://arxiv.org/abs/2107.03374" rel="nofollow">https://arxiv.org/abs/2107.03374</a>

Der_Einzigeabout 1 month ago

This 100% tracks with my experience.Also fun stuff many don't know - If you run a regular models chat template with a reasoning tuned model, it can go back to acting like the base model, with no "thinking" process."Reasoning" models are not any better than non reasoning models. It's a parlor trick, and benchmarks which claimed it wasn't are bad.

评论 #43763426 未加载

kk58about 1 month ago

Reasoning models aren't really reasoners, its basically neural style transfer protocol where you force a model "decoder" to emit tokens in a style that appears to be Reasoning like a deductive thinking.

whatshisfaceabout 1 month ago

If you don't know the answer to a problem, you're not going to be able to repeat sampling until it is correct. Random strings will saturate all benchmarks at k=infinity if tested this way.

11 comments

spwa4about 1 month ago

评论 #43761664 未加载

评论 #43762528 未加载

yorwbaabout 1 month ago

nialv7about 1 month ago

评论 #43764415 未加载

评论 #43766657 未加载

iceman_wabout 1 month ago

KTibowabout 1 month ago

评论 #43763471 未加载

macleginnabout 1 month ago

评论 #43762725 未加载

评论 #43768096 未加载

imtringuedabout 1 month ago

评论 #43770011 未加载

imenaniabout 1 month ago

Der_Einzigeabout 1 month ago

评论 #43763426 未加载

kk58about 1 month ago

whatshisfaceabout 1 month ago

If you don't know the answer to a problem, you're not going to be able to repeat sampling until it is correct. Random strings will saturate all benchmarks at k=infinity if tested this way.

Does RL Incentivize Reasoning in LLMs Beyond the Base Model?

11 comments

Does RL Incentivize Reasoning in LLMs Beyond the Base Model?

11 comments