For me, the paper contained many gems. Besides the superficial alignment hypothesis and its consequences on the fine-tuning dataset, Figure 7 about instruction alignment vs conversation alignment and Figure 9 about the positive correlation of the perplexity number with the quality score (i.e. negative correlation of the perplexity based model quality and with the response based quality) were very insightful.<p>What I missed: How does the superficial alignment hypothesis related to model size (they only investigate disjoint aspects on 7B vs 65B llama models). Since the paper focuses on data quality, I would have expected an annotation guideline.<p>Still, I think the paper is an excellent read.
I think they have some mistake in their analysis<p>> "B Anticorrelation between Perplexity and Generation Quality"<p>> "When fine-tuning LIMA, we observe that perplexity on held-out Stack Exchange data (2,000 examples) negatively correlates with the model’s ability to produce quality responses. To quantify this manual observation, we evaluate model generations using ChatGPT, following the methodology described in Section 5. Figure 9 shows that as perplexity rises with more training steps – which is typically a negative sign that the model is overfitting – so does the quality of generations increase"<p>I think where they say "anticorrelation" it should say "correlation" and where they say "negatively correlates" it should say "positively correlates" if they are basing their statement on what they observed in their experiments.<p>EDIT: I see they say "Preprint. Under review" so maybe they will fix it if it's wrong. This is the kind of thing that peer review is really good at fixing. Also not every submission on arxiv is a preprint or under review but I guess this one is.
This is such an interesting direction for LLM research (especially because it's easy to imagine applicability in industry as well).<p>If all it takes is ~1k high-quality examples (of course, quality can be tricky to define) to tune an LLM successfully, then we should expect to see these tuned LLMs for many different narrow use cases.<p>Of course, devil is likely in the details. Even in this paper, the prompts on which the model is evaluated were written by the authors and "inspired by their own interests or those of their friends." Can be tricky to make a jump from these prompts and answers to real world LLM use cases, but super super promising.
I went in expecting progress on "alignment" as in "how to make sure AI doesn't kill us all" and I saw nothing at all about that in the paper. Disappointing.<p>Using the term "alignment" for what they're trying to do is misleading.
Seems interesting as it runs counter to the "common knowledge" that fine-tuning large LMs needs a lot of data and RLHF for good results.<p>Not that the absolute results are extremely strong, most likely I'd suspect as the base model is just not competitive to GPT4 atm, but the relative results seem very impactful. Maybe fine-tuning a large LM for specific tasks is more practical than thought before?