This is an exceptionally useful article. A few highlights:<p>* QLoRA works really well compared to LoRA if you need to save memory (at the cost of time)<p>* For small LoRAs, Adam has almost no memory usage penalty compared to SGD<p>* Multiple training epochs lower performance (!). To quote: "This performance decline is likely due to increased overfitting, which warrants additional investigation." (Note that this is LoRA overfitting, and unclear which layers it was enabled for for this experiment).<p>* The best results for alpha and r parameters in LoRA seems to be alpha = 2r.<p>* Better datasets are much better. 1k LIMA gives better results than 50k Alpaca