I've been increasingly wondering if the field considering LLMs as a continuum as opposed to a set of distinct thresholds is leading to erroneous "rules of thumb" as most research on methodology is concentrated in smaller and more accessible model experimentation right now.<p>We generally recognize (nearly ad nauseum) that mouse models of medical research don't necessarily translate to humans.<p>Similarly, I'd imagine most would laugh at the idea that a neurology researcher who found the best way to get a fruit fly's brain to navigate a maze should extrapolate that methodology to a dolphin or a chimp's brain.<p>Maybe we should be defining "weight classes" for LLMs and grouping research based on those classes. Like "these are the techniques that work best for lightweight models" but not necessarily assuming those as a general rule of thumb for "heavyweight models."<p>Even something like the discussion of synthetic data on model collapse is a good example of where there might be a very significant difference in the effect on model quality for a cheaper and less sophisticated model generating synthetic data to feed back into itself and a much more complex and sophisticated model. Maybe the lesson is actually "recursive training on synthetic data leads to model collapse <i>in lightweight and medium weight models</i>."<p>So while the writeup is a great one on fine tuning 7B models with LoRA, I would be curious just what % of the recommendations hold true in replication for even just a 65B model.
This is an exceptionally useful article. A few highlights:<p>* QLoRA works really well compared to LoRA if you need to save memory (at the cost of time)<p>* For small LoRAs, Adam has almost no memory usage penalty compared to SGD<p>* Multiple training epochs lower performance (!). To quote: "This performance decline is likely due to increased overfitting, which warrants additional investigation." (Note that this is LoRA overfitting, and unclear which layers it was enabled for for this experiment).<p>* The best results for alpha and r parameters in LoRA seems to be alpha = 2r.<p>* Better datasets are much better. 1k LIMA gives better results than 50k Alpaca
LoRA blew me away the first time I looked into it. Especially since you can host many LoRA adapters at once for a fraction of the cost of hosting an entire model by sharing the base between the adapters. I built a little tool to make LoRA fine-tuning easier. The adapters export to Huggingface. You can check it out here: <a href="https://app.haven.run">https://app.haven.run</a>
I fine tuned LLama-2 on code/comment generation (in python) for around $2 and was able to run it natively on an m1 macbook air. I can totally see smaller fine tuned LLM's being used locally on consumer devices in the future. I think people underestimate how cheap and efficient this stuff is.<p>I've actually built a service which lets you fine tune LLama-2/other llms by uploading a JSON dataset. I'm looking for feedback, the link is <a href="https://useftn.com" rel="nofollow noreferrer">https://useftn.com</a>.
I’ve been thinking about ways to compress information with ai for long distance transmission with LoRA radio for a while and now this LoRA in the news gets me all confused.
what is the toolset that works best ?<p>axolotl is generally recommended...but unsure if that is what is genuinely the best for production scale finetuning.
Ever since the author paywalled some of his useful posts, I stopped following him. I have read his ML book and I know he used to be a professor and is now working in the industry, and he’s quite famous in the field. That’s why I don’t understand why such a figure would even need the extra income generated by Substack’s paywall.