Discussions about LLM alignment often miss topics of data quality and quantity. It turns out that current models like Llama 2 use 10K+ prompts and responses for supervised fine-tuning (SFT) and 100K+ human preference pairs. While the preferences are pretty easy to annotate, producing a good SFT dataset is uneasy.<p><a href="https://evalovernite.substack.com/p/rlhf-math-aint-enough" rel="nofollow noreferrer">https://evalovernite.substack.com/p/rlhf-math-aint-enough</a><p><a href="https://doi.org/10.5281/zenodo.8186168" rel="nofollow noreferrer">https://doi.org/10.5281/zenodo.8186168</a>