The other recent improvement suggested for LoRA is DoRA: <a href="https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch" rel="nofollow">https://magazine.sebastianraschka.com/p/lora-and-dora-from-s...</a>. It really does seem to strongly outperform LoRA - see also <a href="https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html" rel="nofollow">https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.htm...</a>
I’m struggling to understand from this paper whether the approach is better in the general sense (all cases, with wider models seeing greater benefits) or purely for wider models (with narrower models seeing detriment)?<p>If it’s the former this could effectively halve finetuning cost overnight which would go a significant way towards enabling a wider array of use cases for LoRA.
I've had sucess with GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection <a href="https://arxiv.org/abs/2403.03507" rel="nofollow">https://arxiv.org/abs/2403.03507</a>
What an unfortunate name... I initially thought this was about wireless communication. <a href="https://en.wikipedia.org/wiki/LoRa" rel="nofollow">https://en.wikipedia.org/wiki/LoRa</a>