Great work, lots of useful information here.<p>The only thing I wish you did different was explored alpha > 2 * r.<p>In this blog post, the author found that alpha of 4 * r (where r=64) outperformed all smaller alphas in terms of loss when finetuning Llama-7b on databricks-dolly-15k.<p><a href="https://medium.com/@drishtisharma96505/comparative-analysis-of-lora-parameters-on-llama-2-with-flash-attention-574b913295d4" rel="nofollow noreferrer">https://medium.com/@drishtisharma96505/comparative-analysis-...</a><p>Additionally you identify (alpha = 2*r) r=16 as inferior to r=256, however aside from arithmetic, r=16 actually outperforms all others. And the base model outperforms any finetuning for both arithmetic metrics.
Seemed to take terrible performing models and marginally improve them. None of them are fit for purpose at the end of the training, so what was the point?
I'd like to see a writeup of LORAs for something that is readily tractable without LORAs. E.g., a pretrained ResNet34 ImageNet model that gets a LORA instead of being fine-tuned or fully re-trained. The pedagogical value is that it can be compared to the alternatives which are tractable in this setting (and which are not in an LLM setting).
LoRAs are, like most fine-tuning, a spectrum.<p>LoRAs can be nearly the same size as the original model, with nearly the same representation capacity/trainability, or they can be a tiny tiny fraction of the original model size, with correspondingly fewer learnable parameters.<p>As such, they are suitable for nearly all tasks. We should be asking if they are better than regular fine-tuning, or soft-prompts (aka textual inversion), or slapping new trainable layers on the end (aka hypernetworks). The stable diffusion community seems to think that they are.
This is amazing, thank you!!<p>> My hypothesis is that the Alpaca dataset does not contain any related arithmetic tasks, and the model actively unlearns basic arithmetic when it focuses more on other tasks.<p>I'm surprised this wasn't verified, it's a major benchmark stat. My eyes keep getting drawn to it, because it seems to have the most variance. Does anyone know?<p>Also throwing it out, I would love to see a Neptune/Wnb of the hyperparameter tuning :)
claude2 tldr:
Here are a few key points I gathered from the article:<p>The article explores optimizing LoRA hyperparameter settings for finetuning large language models.
The goal is to maximize performance while minimizing memory usage and training time.<p>The base model used is Llama-2 7B. Experiments compare default LoRA, QLoRA (4-bit quantized), AdamW, SGD optimizers, and different choices for rank r and alpha hyperparameters.<p>Key findings:<p>QLoRA provides substantial memory savings (6GB less than default LoRA) at the cost of slower training. Performance impact is minor.<p>AdamW vs SGD makes little difference in memory or performance.<p>Increasing training iterations from 50k to 100k hurts performance, likely because the Alpaca dataset lacks diversity.<p>Tuning rank r and alpha is most impactful. Good rule of thumb is to set alpha=2*r.
Best model uses r=256, alpha=512. Improves over base model on most tasks, except arithmetic.<p>The optimized LoRA model was submitted to the NeurIPS efficiency challenge and showed improvements on several benchmarks compared to the base Llama-2 model.<p>Takeaways are practical tips for tuning LoRA hyperparameters and trading off memory, compute, and model performance.
Good article but my god why is there like a 2 second delay before every user interaction on this page. Scrolling, selecting text, everything has some kind of 2 second 'bootup' time after which things work normally but if you stop interacting with it for a bit it goes back to some idle mode. Really weird.