Nice article, I'm not in this field, however, my understanding of the original paper was that the LoRA was applied only on the last dense layer, and not to all independently (maybe I misread it originally).<p>Digging a bit in why the implementation is like this in the link, I found that in QLoRA they used this and it seems to have some interesting effects, maybe adding a note on the QLoRA decision would be nice :)<p>I'm not sure I understand why it works though, my neophyte view was that applying LoRA to the last layer made sense, but, I do not wrap my mind on the rationale of applying it repeadly to each linear layer. Can someone explain their intuition?