I admit I'm a bit confused by the reward function, as given it seems to provide the same score independent of correctness due to the squaring? And I think even if that's a mistake and it's supposed to be negative for incorrect answers, a policy that optimizes for that reward is to output 1 for anything with less than a 50% chance of being true and 10 for anything over 50%. Is that how RL is typically done?