TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

LIMA: Less Is More for Alignment

27 pointsby lebekalmost 2 years ago

6 comments

dball9almost 2 years ago
For me, the paper contained many gems. Besides the superficial alignment hypothesis and its consequences on the fine-tuning dataset, Figure 7 about instruction alignment vs conversation alignment and Figure 9 about the positive correlation of the perplexity number with the quality score (i.e. negative correlation of the perplexity based model quality and with the response based quality) were very insightful.<p>What I missed: How does the superficial alignment hypothesis related to model size (they only investigate disjoint aspects on 7B vs 65B llama models). Since the paper focuses on data quality, I would have expected an annotation guideline.<p>Still, I think the paper is an excellent read.
ftxbroalmost 2 years ago
I think they have some mistake in their analysis<p>&gt; &quot;B Anticorrelation between Perplexity and Generation Quality&quot;<p>&gt; &quot;When fine-tuning LIMA, we observe that perplexity on held-out Stack Exchange data (2,000 examples) negatively correlates with the model’s ability to produce quality responses. To quantify this manual observation, we evaluate model generations using ChatGPT, following the methodology described in Section 5. Figure 9 shows that as perplexity rises with more training steps – which is typically a negative sign that the model is overfitting – so does the quality of generations increase&quot;<p>I think where they say &quot;anticorrelation&quot; it should say &quot;correlation&quot; and where they say &quot;negatively correlates&quot; it should say &quot;positively correlates&quot; if they are basing their statement on what they observed in their experiments.<p>EDIT: I see they say &quot;Preprint. Under review&quot; so maybe they will fix it if it&#x27;s wrong. This is the kind of thing that peer review is really good at fixing. Also not every submission on arxiv is a preprint or under review but I guess this one is.
rish-balmost 2 years ago
This is such an interesting direction for LLM research (especially because it&#x27;s easy to imagine applicability in industry as well).<p>If all it takes is ~1k high-quality examples (of course, quality can be tricky to define) to tune an LLM successfully, then we should expect to see these tuned LLMs for many different narrow use cases.<p>Of course, devil is likely in the details. Even in this paper, the prompts on which the model is evaluated were written by the authors and &quot;inspired by their own interests or those of their friends.&quot; Can be tricky to make a jump from these prompts and answers to real world LLM use cases, but super super promising.
csensealmost 2 years ago
I went in expecting progress on &quot;alignment&quot; as in &quot;how to make sure AI doesn&#x27;t kill us all&quot; and I saw nothing at all about that in the paper. Disappointing.<p>Using the term &quot;alignment&quot; for what they&#x27;re trying to do is misleading.
评论 #36040035 未加载
评论 #36038370 未加载
mk67almost 2 years ago
Seems interesting as it runs counter to the &quot;common knowledge&quot; that fine-tuning large LMs needs a lot of data and RLHF for good results.<p>Not that the absolute results are extremely strong, most likely I&#x27;d suspect as the base model is just not competitive to GPT4 atm, but the relative results seem very impactful. Maybe fine-tuning a large LM for specific tasks is more practical than thought before?
评论 #36038050 未加载
ftxbroalmost 2 years ago
I wish they would say more about their training setup like how many of which kind of gpgpu for how long.