Discussions about LLM alignment often miss topics of data quality and quantity. It turns out that current models like Llama 2 use 10K+ prompts and responses for supervised fine-tuning (SFT) and 100K+ human preference pairs. While the preferences are pretty easy to annotate, producing a good SFT dataset is uneasy.<p><a href="https://evalovernite.substack.com/p/rlhf-math-aint-enough" rel="nofollow noreferrer">https://evalovernite.substack.com/p/rlhf-math-aint-enough</a><p><a href="https://doi.org/10.5281/zenodo.8186168" rel="nofollow noreferrer">https://doi.org/10.5281/zenodo.8186168</a>
I read here that Yann LeCun claimed that even with RLHF, LLMs will still hallucinate - that it's an unavoidable consequence of their autoregressive nature<p><a href="https://www.hopsworks.ai/dictionary/rlhf-reinforcement-learning-from-human-feedback" rel="nofollow noreferrer">https://www.hopsworks.ai/dictionary/rlhf-reinforcement-learn...</a>