TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Mamba-2 – State Space Duality

151 点作者 bratao12 个月前

8 条评论

cs70212 个月前
Dao and Gu show that if you simplify Mamba so its state-space layer uses a diagonal matrix A_t that is a scalar times the identity matrix, i.e., A_t = a_t I, the state-space transformation can be expressed as a form of causal linear attention[a] by compounding coefficients a_1 ... a_t at each time step t. The equivalence of the simplified state-space layer and causal linear attention constitute the duality the authors refer to in the title. By taking advantage of this duality, Mamba-2 can be trained more efficiently, i.e., faster than original Mamba on GPUs.<p>Theoretical stuff aside, Mamba-2&#x27;s performance seems to scale slightly better than original Mamba: <a href="https:&#x2F;&#x2F;tridao.me&#x2F;assets&#x2F;img&#x2F;2024-05-31-mamba-2&#x2F;pile_8k_mamba2-1400.webp" rel="nofollow">https:&#x2F;&#x2F;tridao.me&#x2F;assets&#x2F;img&#x2F;2024-05-31-mamba-2&#x2F;pile_8k_mamb...</a><p>Here&#x27;s the code implementing Mamba-2: <a href="https:&#x2F;&#x2F;github.com&#x2F;state-spaces&#x2F;mamba&#x2F;blob&#x2F;main&#x2F;mamba_ssm&#x2F;modules&#x2F;mamba2.py">https:&#x2F;&#x2F;github.com&#x2F;state-spaces&#x2F;mamba&#x2F;blob&#x2F;main&#x2F;mamba_ssm&#x2F;mo...</a><p>Great work by Tri Dao (of FlashAttention fame) and Albert Gu, as usual.<p>The key question, for me and many others, is whether Mamba, Mamba-2, RWKV, and other linear RNN &#x2F; linear attention models will ever match the performance of standard Softmax attention. My understanding and experience is that all the linear attention models out there [b] still underperform Softmax attention on things like recall tasks.[c]<p>---<p>[a] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2006.16236" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2006.16236</a><p>[b] <a href="https:&#x2F;&#x2F;github.com&#x2F;topics&#x2F;linear-attention">https:&#x2F;&#x2F;github.com&#x2F;topics&#x2F;linear-attention</a> &#x2F; <a href="https:&#x2F;&#x2F;github.com&#x2F;topics&#x2F;linear-attention-model">https:&#x2F;&#x2F;github.com&#x2F;topics&#x2F;linear-attention-model</a> -- this list is by no means complete!<p>[c] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.01032" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.01032</a>
评论 #40566025 未加载
评论 #40566786 未加载
评论 #40572087 未加载
evnc12 个月前
I&#x27;m a bit of a noob here, but if<p>a) a linear SSM (a form of RNN?) is equivalent to Attention without the scaling and softmax; and<p>b) Attention is &quot;all you need&quot; and the thing that made Transformers radically outperform all the previous architectures like LSTMs that used to dominate NLP;<p>does that imply c) the scaling and softmax parts of the attention equation, in particular, is the magic touch that makes Transformers work so well?
评论 #40566998 未加载
pama12 个月前
Has anyone tried training yet and are there any obvious pitfalls for multi GPU training like there were in mamba-1?
adt12 个月前
<a href="https:&#x2F;&#x2F;lifearchitect.ai&#x2F;models-table&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lifearchitect.ai&#x2F;models-table&#x2F;</a>
评论 #40565415 未加载
评论 #40595454 未加载
imjonse12 个月前
&quot;From one perspective, Mamba-2 isn’t strictly better than Mamba-1: while it’s a dramatic improvement from a training perspective, Mamba-1 might be better from a pure inference perspective. Since inference speed of SSMs is entirely governed by the state dimension, if one wants to maximize performance for a target inference efficiency (i.e. for a particular state size N), then the increased expressivity of Mamba-1 might be better.&quot;
评论 #40565967 未加载
tomrod12 个月前
This appears to be huge. Win-win-win for fast LU factorization!
eranation12 个月前
I’ll bite: can anyone please eli5 to the non PhDs among us?
评论 #40564768 未加载
评论 #40564717 未加载
评论 #40565871 未加载
sroussey12 个月前
TLDR for non-NLP people: Mamba-2 is <i>much</i> faster to train than Mamba-1.