科技回声

1 comment

oersted大约 1 个月前

From the DeepSeek paper, they did try but found that the model would learn to cheat the judge. It doesn't seem to be impossible, but probably a serious challenge.<p>> We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

评论 #43624643 未加载

Can reinforcement learning for LLMs scale beyond math and coding tasks? Probably

1 comment

Can reinforcement learning for LLMs scale beyond math and coding tasks? Probably

1 comment