科技回声

8 条评论

cs70212 个月前

Dao and Gu show that if you simplify Mamba so its state-space layer uses a diagonal matrix A_t that is a scalar times the identity matrix, i.e., A_t = a_t I, the state-space transformation can be expressed as a form of causal linear attention[a] by compounding coefficients a_1 ... a_t at each time step t. The equivalence of the simplified state-space layer and causal linear attention constitute the duality the authors refer to in the title. By taking advantage of this duality, Mamba-2 can be trained more efficiently, i.e., faster than original Mamba on GPUs.Theoretical stuff aside, Mamba-2's performance seems to scale slightly better than original Mamba: <a href="https://tridao.me/assets/img/2024-05-31-mamba-2/pile_8k_mamba2-1400.webp" rel="nofollow">https://tridao.me/assets/img/2024-05-31-mamba-2/pile_8k_mamb...</a>Here's the code implementing Mamba-2: <a href="https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2.py">https://github.com/state-spaces/mamba/blob/main/mamba_ssm/mo...</a>Great work by Tri Dao (of FlashAttention fame) and Albert Gu, as usual.The key question, for me and many others, is whether Mamba, Mamba-2, RWKV, and other linear RNN / linear attention models will ever match the performance of standard Softmax attention. My understanding and experience is that all the linear attention models out there [b] still underperform Softmax attention on things like recall tasks.[c]---[a] <a href="https://arxiv.org/abs/2006.16236" rel="nofollow">https://arxiv.org/abs/2006.16236</a>[b] <a href="https://github.com/topics/linear-attention">https://github.com/topics/linear-attention</a> / <a href="https://github.com/topics/linear-attention-model">https://github.com/topics/linear-attention-model</a> -- this list is by no means complete![c] <a href="https://arxiv.org/abs/2402.01032" rel="nofollow">https://arxiv.org/abs/2402.01032</a>

评论 #40566025 未加载

评论 #40566786 未加载

评论 #40572087 未加载

evnc12 个月前

I'm a bit of a noob here, but ifa) a linear SSM (a form of RNN?) is equivalent to Attention without the scaling and softmax; andb) Attention is "all you need" and the thing that made Transformers radically outperform all the previous architectures like LSTMs that used to dominate NLP;does that imply c) the scaling and softmax parts of the attention equation, in particular, is the magic touch that makes Transformers work so well?

评论 #40566998 未加载

pama12 个月前

Has anyone tried training yet and are there any obvious pitfalls for multi GPU training like there were in mamba-1?

adt12 个月前

<a href="https://lifearchitect.ai/models-table/" rel="nofollow">https://lifearchitect.ai/models-table/</a>

评论 #40565415 未加载

评论 #40595454 未加载

imjonse12 个月前

"From one perspective, Mamba-2 isn’t strictly better than Mamba-1: while it’s a dramatic improvement from a training perspective, Mamba-1 might be better from a pure inference perspective. Since inference speed of SSMs is entirely governed by the state dimension, if one wants to maximize performance for a target inference efficiency (i.e. for a particular state size N), then the increased expressivity of Mamba-1 might be better."

评论 #40565967 未加载

tomrod12 个月前

This appears to be huge. Win-win-win for fast LU factorization!

eranation12 个月前

I’ll bite: can anyone please eli5 to the non PhDs among us?

评论 #40564768 未加载

评论 #40564717 未加载

评论 #40565871 未加载

sroussey12 个月前

TLDR for non-NLP people: Mamba-2 is much faster to train than Mamba-1.

8 条评论

cs70212 个月前

评论 #40566025 未加载

评论 #40566786 未加载

评论 #40572087 未加载

evnc12 个月前

评论 #40566998 未加载

pama12 个月前

Has anyone tried training yet and are there any obvious pitfalls for multi GPU training like there were in mamba-1?

adt12 个月前

<a href="https://lifearchitect.ai/models-table/" rel="nofollow">https://lifearchitect.ai/models-table/</a>

Mamba-2 – State Space Duality

8 条评论

Mamba-2 – State Space Duality

8 条评论