Excellent background on knowledge calibration from Anthropic:<p><a href="https://arxiv.org/abs/2207.05221" rel="nofollow">https://arxiv.org/abs/2207.05221</a><p>"Calibration" in a knowledge context means having estimated_p(correct) ~ p(correct), and it turns out that LLMs are reasonably good at this. Also a core reason why LLM-as-a-judge works so well: quality evaluation is vastly easier than generation.
<not this paper approach><p>This is one of the key prompting in a lot of Enterprise cases. You can currently prompt LLMs to add a confidence score along with their responses.<p>Especially when you are using LLMs for downstream NLP tasks.<p>The confidence score can be a great indicator also for applying a two-tier model approach!
I admit I'm a bit confused by the reward function, as given it seems to provide the same score independent of correctness due to the squaring? And I think even if that's a mistake and it's supposed to be negative for incorrect answers, a policy that optimizes for that reward is to output 1 for anything with less than a 50% chance of being true and 10 for anything over 50%. Is that how RL is typically done?