Dao and Gu show that if you simplify Mamba so its state-space layer uses a diagonal matrix A_t that is a scalar times the identity matrix, i.e., A_t = a_t I, the state-space transformation can be expressed as a form of causal linear attention[a] by compounding coefficients a_1 ... a_t at each time step t. The equivalence of the simplified state-space layer and causal linear attention constitute the duality the authors refer to in the title. By taking advantage of this duality, Mamba-2 can be trained more efficiently, i.e., faster than original Mamba on GPUs.<p>Theoretical stuff aside, Mamba-2's performance seems to scale slightly better than original Mamba: <a href="https://tridao.me/assets/img/2024-05-31-mamba-2/pile_8k_mamba2-1400.webp" rel="nofollow">https://tridao.me/assets/img/2024-05-31-mamba-2/pile_8k_mamb...</a><p>Here's the code implementing Mamba-2: <a href="https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2.py">https://github.com/state-spaces/mamba/blob/main/mamba_ssm/mo...</a><p>Great work by Tri Dao (of FlashAttention fame) and Albert Gu, as usual.<p>The key question, for me and many others, is whether Mamba, Mamba-2, RWKV, and other linear RNN / linear attention models will ever match the performance of standard Softmax attention. My understanding and experience is that all the linear attention models out there [b] still underperform Softmax attention on things like recall tasks.[c]<p>---<p>[a] <a href="https://arxiv.org/abs/2006.16236" rel="nofollow">https://arxiv.org/abs/2006.16236</a><p>[b] <a href="https://github.com/topics/linear-attention">https://github.com/topics/linear-attention</a> / <a href="https://github.com/topics/linear-attention-model">https://github.com/topics/linear-attention-model</a> -- this list is by no means complete!<p>[c] <a href="https://arxiv.org/abs/2402.01032" rel="nofollow">https://arxiv.org/abs/2402.01032</a>