claude2 tldr:
Here are a few key points I gathered from the article:<p>The article explores optimizing LoRA hyperparameter settings for finetuning large language models.
The goal is to maximize performance while minimizing memory usage and training time.<p>The base model used is Llama-2 7B. Experiments compare default LoRA, QLoRA (4-bit quantized), AdamW, SGD optimizers, and different choices for rank r and alpha hyperparameters.<p>Key findings:<p>QLoRA provides substantial memory savings (6GB less than default LoRA) at the cost of slower training. Performance impact is minor.<p>AdamW vs SGD makes little difference in memory or performance.<p>Increasing training iterations from 50k to 100k hurts performance, likely because the Alpaca dataset lacks diversity.<p>Tuning rank r and alpha is most impactful. Good rule of thumb is to set alpha=2*r.
Best model uses r=256, alpha=512. Improves over base model on most tasks, except arithmetic.<p>The optimized LoRA model was submitted to the NeurIPS efficiency challenge and showed improvements on several benchmarks compared to the base Llama-2 model.<p>Takeaways are practical tips for tuning LoRA hyperparameters and trading off memory, compute, and model performance.