I think smaller expert models will dominate the majority of applications. there is an optimum and fine balance to strike when it comes to size and usability. There will be many mechanisms like demonstrated in the post to find that optimum and realize it.
Interesting that they use T5 for the distilled model. I was under the impression that encoder-decoder architectures were on the way of the Dodo, but it seems they may still be relevant after all.<p>Also interesting is that this isn't an inconceivably clever and out of the box idea. It shows there's still a lot of low hanging fruit to explore, and the future of LLMs isn't set in stone yet. Could be that the real deal is a mixture of experts trained in this style. It's exciting that it feels the holy grail is close to being achievable if only the right combination of ideas is tried.
The amount of activity and progress in the LLM/ML/AI spaces is truly fantastic. Optimizations like this are particularly valuable when hardware (e.g. Nvidia) is so expensive.
So this <a href="https://arxiv.org/abs/2212.08410" rel="nofollow noreferrer">https://arxiv.org/abs/2212.08410</a> but one year later
I am not a researcher, but it always seemed intuitive to me that the most effective models would be multimodal and trained with a core carefully tailored curriculum.<p>I would want to ensure that the system gains and retains the fundamental structures and skills that you know it needs to effectively and accurately generalize. While maintaining those things you then feed it lots of diverse data to learn the exceptions and ways the skills can be combined. But somehow you need to ensure those core skills and knowledge throughout. Maybe you could do that just by including outputting those understandings or manipulations in addition to the final answer. Similar to what the paper does.<p>For example, a code generation model might be required to output a state machine simulation of the requested program.
Why the amount of the training data for LLM is less than for the distilled and task-specific models (in the first figure)?<p>Or did the authors count the amount of training data for the LLMs to the required training data for the destined/task-specific models?<p><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeIs4yaBA3Ir55j869FMzdmRdf7OxiIjsWl05GU48ikYOHZGLk1H8tIHeKKBaY_xER0QITv5DUhADZvqS1os6mNA_nLQKqwW7DOXnwcnPl6BhsMJ_LKTvglGUrHR5_QC8MIe3K7i9zyfcWkwzvjPhXLifYijgkeeG_1yn9EMm-ol9eI9Cv_rz71wMyGfk2/s1570/image3.png" rel="nofollow noreferrer">https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj...</a>
Is it that a lot of capacity is unused in those behemoth LLMs, or that the smaller language model just mimics the reasoning task? (Mimics the mimicking?)
Interesting! Do you think RLHF would be a necessity for smaller models to perform as par as state-of-the-art LLMs? In my view, instruction tuning will resolve any isssues related to output structure, tonality or the domain understanding but will it be enough to improve the reasoning capabilities of the smaller model?
> For instance, serving a single 175 billion LLM requires at least 350GB of GPU memory using specialized infrastructure<p>Apple ships the Mac Studio which support up to 144GB of usable GPU memory.<p>Would be amusing if they were to release a Mac Pro with 300+ GB and dominate the LLM serving space.
<i>> given the input question “Sammy wanted to go to where the people are. Where might he go? Answer Choices: (a) populated areas, (b) race track, (c) desert, (d) apartment, (e) roadblock”, distilling step-by-step provides the correct answer to the question, “(a) populated areas”</i><p>Huh? My answer as a human would have been "race track", as that is probably "where the people are" (during a race).<p>Did I fail? Am I a poor language model? Or is the whole thing just tea leaf reading to begin with?