New attention mechanisms that outperform standard multi-head attention

233 pointsby snats12 months ago

16 comments

> The Transformer models, used in this experiment, all have a single attention layer with model dimension and context length 32.I think we are going to need to see more experiments here, especially because the theoretical motivations here are weak

评论 #40525749 未加载

westurner12 months ago

Because self-attention can be replaced with FFT for a loss in accuracy and a reduction in kWh [1], I suspect that the Quantum Fourier Transform can also be substituted for attention in LLMs.[1] <a href="https://syncedreview.com/2021/05/14/deepmind-podracer-tpu-based-rl-frameworks-deliver-exceptional-performance-at-low-cost-19/" rel="nofollow">https://syncedreview.com/2021/05/14/deepmind-podracer-tpu-ba...</a>"You Need to Pay Better Attention" (2024) <a href="https://arxiv.org/abs/2403.01643" rel="nofollow">https://arxiv.org/abs/2403.01643</a> :> Our first contribution is Optimised Attention, which performs similarly to standard attention, but has 3/4 as many parameters and one matrix multiplication fewer per head. Next, we introduce Efficient Attention, which performs on par with standard attention with only 1/2 as many parameters as many parameters and two matrix multiplications fewer per head and is up to twice as fast as standard attention. Lastly, we introduce Super Attention, which surpasses standard attention by a significant margin in both vision and natural language processing tasks while having fewer parameters and matrix multiplications."Leave No Context Behind: Efficient Infinite Context Transformers" (2024) <a href="https://arxiv.org/abs/2404.07143" rel="nofollow">https://arxiv.org/abs/2404.07143</a> :> A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

评论 #40525739 未加载

评论 #40519970 未加载

aconz211 months ago

Late to the party, but I think my summary is (L is context length, C is hidden dimension, H is headsize, C = H * nh):3.1 Optimised attention: Instead of using a learned W_V to project from C to H, slice V into H sized vectors. (V is just the input tokens X). This is because the matrix multiply is to a lower dimension anyway, so why not just slice. Slicing is just reshaping (L, C) -> (L, nh, H)3.2 Efficient attention: I think this opens with a typo, "In the last section, we discussed how and why we can remove W_O..." should be W_V not W_O I think. Anyways, same as above, just for the keys this time. Reshape K (which is just X) from (L, C) -> (L, nh, H)3.3 Super attention: Introduce an (L, L) W_A (lower triangular for masked) that transforms V on the left (X again) from (L, C) -> (L, C) (whereas standard attention has W_V (C, C) that transforms (L, C) -> (L, C) from the right). And they share W_A between heads.More efficient when C > L, so for long context models, probably not more efficient.I think the first two modifications are equivalent to just setting W_V and W_K to the constant identity matrices right? So that makes me think what would happen if you instead restrict W_V (and/or W_K, W_Q) to be block diagonal (though non square) such that each head has in effect an (H, H) matrix which transforms the slice of X it receives. This is different than standard attention right? Because there the W_V acts over the full C dimension. Almost surely someone has thought of this so I will try to find outStill learning so all this could be wrong

GaggiX12 months ago

The models tested are extremely small, a few thousand parameters and the performance is of course not great, I don't think we can extrapolate much from this. I don't understand why they chose such small models when you can train much larger ones for free on Colab or Kaggle if you really need it.

marcinzm12 months ago

These seems very tiny models and as I understand it LLMs behave fairly differently at different scales.The speed performance gain seems to only be on an M2 chip and I wonder if there's already much better non-GPU optimized attention approaches out there for those use cases.

smaddox12 months ago

The first two changes appear theoretically sound, but it's not clear that they would result in an actual performance improvement at scale. Their analysis ignores that a single matrix multiplication is typically used to calculate the Q, K, and V values from the inputs.The third change looks like it would break causal masking for auto regressive language models. For masked token language models and ViTs, perhaps it's an improvement, though.

lukemerrick12 months ago

Just skimmed so far and didn't see any reference to the Simplified Transformer block of <a href="https://arxiv.org/abs/2311.01906" rel="nofollow">https://arxiv.org/abs/2311.01906</a> (and it seems they also left out grouped query attention, too, as pointed out by another comment).While lazy me wants them to explain how their approach compares to these approaches, it looks like their exposition is pretty clear (quite nice for a preprint!) and I guess I'll just have to actually read the paper for real to see for myself.Given how well I've seen Simplified Transformer blocks work in my own playground experiments, I would not at all be surprised if other related tweaks work out well even on larger scale models. I wish some of the other commenters here had a bit more curiosity and/or empathy for these two authors who did a fine job coming up with and initially testing out some worthwhile ideas.

renonce12 months ago

Another piece of solid work in this space is DeepSeek-v2. They proposed MLA which outperform standard attention a little but reduce KV cache by over a magnitude. Not sure if these improvements could come together.

Hugsun12 months ago

I'm surprised they don't mention grouped query attention at all.

novaRom12 months ago

> more efficient than standard attention whenever the model dimension is greater than or equal to the context lengthall practical models have context length significantly larger than model dimension

jawon12 months ago

Can anyone point me to models that look like they might actually be useful in moving towards AGI? I feel like I have a basic understanding of the transformer architecture, and multiplying X tokens in a sliding window across a set of static matrices to produce 1 new token does not look like a path to AGI.Yes, the complex feature extraction is impressive. But are there any models that, I don't know, are more dynamic? Have a working memory? Have a less limited execution path?

评论 #40519731 未加载

评论 #40520670 未加载

评论 #40519796 未加载

评论 #40519727 未加载

评论 #40519005 未加载

评论 #40520090 未加载

behnamoh12 months ago

Pressing [X] to doubt.There are many alternatives to the good old transformers: RWKV, Mamba, etc.Yet here we are, still using transformers (actually, just the decoder part). Is it because the industry has so much inertia to pick up new methods? I doubt it because there's $BILLIONS in this market and everyone wants a piece of the AI cake, so it doesn't make sense to ignore promising methods.Why, then, we barely see any non-transformer production-ready LLM these days?

评论 #40517757 未加载

评论 #40517116 未加载

评论 #40518455 未加载

评论 #40517811 未加载

toxik12 months ago

I feel like FlashAttention is the relevant baseline here.

评论 #40516678 未加载

verisimi12 months ago

> However, the behemothic sizes of these models have introduced numerous challenges, such as expensive and slow training and inference, leading to secondary problems such as high carbon emission, contributing to global warmingYes, there really has been an awful lot of hot air about ai.

31707012 months ago

> we evaluate the presented attention mechanisms on MNIST, CIFAR100, IMDB Movie Reviews, and Amazon Reviews datasets.It sounds amazing, but I'm not holding my breath this one will scale.

评论 #40517417 未加载

评论 #40518333 未加载

评论 #40516619 未加载

skyde12 months ago

Where is the code for it ?