From the DeepSeek paper, they did try but found that the model would learn to cheat the judge. It doesn't seem to be impossible, but probably a serious challenge.<p>> We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.