Differential Transformer

562 pointsby weirdcat7 months ago

31 comments

Imnimo7 months ago

I feel like I'm missing a key insight here. I understand the problem that regular softmax attention struggles to approach assigning zero attention to irrelevant stuff. And I get that having this subtraction formula makes it possible to assign exactly (or near) zero attention weight without having crazy outlier activations. But it seems like it also makes it very easy to have negative attention weight (which is equivalent to having positive attention weight on the negation of your value vectors). Intuitively, it just feels like a difficult balancing act to keep all the stuff you don't care about so close to zero.But Figure 1 clearly shows that it works, so I don't doubt that it is in fact possible. I'm just struggling to build a picture of how exactly the network accomplishes this.

评论 #41780844 未加载

评论 #41792020 未加载

评论 #41779968 未加载

评论 #41780900 未加载

评论 #41779779 未加载

aDyslecticCrow7 months ago

Very clever. I like this kind of nitty-gritty detail work, and the change is small enough to be adapted easily by others. Bravo!I'm a little concerned about the last sentence of the section introduction of "2 Differential Transformer". It mentions using improvements from previous papers, but in the grammatical context, it's unclear if this improvement is added to both the normal transformer and their diff transformer. This would otherwise sully the comparisons. It's the "main difference" wording in the previous sentence that raised a flag for me.Of course, a good-faith researcher would know this and may not feel the need to clarify. But you can never be too careful about some published research in this field.

评论 #41780849 未加载

评论 #41778608 未加载

msoad7 months ago

Like most things in this new world of Machine Learning, I'm really confused why this works?The analogy to noise-cancelling headphones is helpful but in that case we clearly know which is signal and which is noise. Here, if we knew why would we even bother to the noise-cancelling work?

评论 #41777492 未加载

评论 #41777981 未加载

评论 #41778514 未加载

评论 #41780581 未加载

评论 #41777095 未加载

评论 #41778030 未加载

评论 #41780220 未加载

评论 #41777464 未加载

评论 #41776928 未加载

islewis7 months ago

> Differential attention takes the difference between two softmax attention functions to eliminate attention noiseIf I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.> According to the fitted curves, 6.8B-size DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parametersThis raises a few questions for me:- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?- Does that tradeoff change noticeably between training and inference?

评论 #41779508 未加载

评论 #41780050 未加载

评论 #41779423 未加载

评论 #41780064 未加载

WithinReason7 months ago

We empirically find that the setting λᵢₙᵢₜ = 0.8 − 0.6 × exp(−0.3 · (l − 1)) works well in practiceI wonder about the story behind that formula...

评论 #41779303 未加载

评论 #41781032 未加载

评论 #41781407 未加载

iandanforth7 months ago

The key bit I didn't understand at first was what happens if the two groups of attention learn the same thing; because their attention masks are subtracted from one another if they both output similar values the attention across the board will drop to zero and this will lead to high loss. So the only way to reduce loss is if they learn to attend to different things. One of the simplest strategies they could learn (and this paper claims that they do) is for one group to focus on relevant context and the other to focus on irrelevant context. Thus one group learns the noise and the other the signal (it's not this cut and dry but is a useful simplification for understanding IMO).

评论 #41778252 未加载

评论 #41780196 未加载

评论 #41777939 未加载

评论 #41778284 未加载

patcon7 months ago

I wonder what is lost here. Surely there's a trade-off...I'm wondering if there's any effect of "creativity", or ability to interpolate between concepts. Hallucination and creativity feel very related to me. I understand hallucinating as simply being misaligned with the space humans feel appropriate to interpolate between

评论 #41778047 未加载

评论 #41777640 未加载

评论 #41777437 未加载

chessgecko7 months ago

I wonder how much of the value here is from canceling out the positional noise rope produces. I would love to see a table comparing an alibi version of this to an alibi baseline in addition to the rope models here.Crazy gains though congrats to the researchers

vsroy7 months ago

Is the thing that's going on here that softmax can't push a value to 0, but by subtracting 2 softmax maps we can output 0s?

评论 #41781421 未加载

评论 #41778150 未加载

machinelearning7 months ago

This is a good problem to solve but the approach is wrong imo.It has to be done in a hierarchical way to know what you attended to + full context.If the differential vector is being computed with the same input as the attention vector how do you know how to modify the attention vector correctly

评论 #41780027 未加载

pxdm7 months ago

What's the comparison with conventional attention using a more aggressive (lower temperature) softmax? I can imagine that for the multi-needle retrieval test this may also give a performance boost, although at some cost other more creative tasks.

评论 #41781277 未加载

nmacias7 months ago

AdderaLLM was right there

miven7 months ago

Is there an intuitive reason why this ends up working this well compared to, say, applying some kind of thresholding to attention activations that are below average for a given head to filter that same attention noise out?

pizza7 months ago

Was just going to mention that it seems that it should be possible to make a Flash Attention version of this algorithm and was pleasantly surprised to see they already included an implementation of one :)

watsonmusic7 months ago

The modification is simple and beautiful. And the improvements are quite significant.

singularity20017 months ago

Anyone remember siamese networks?

slashdave7 months ago

I don't get it. Arbitrary linear combinations are already accommodated via feed forward. What am I missing?

评论 #41781616 未加载

评论 #41785363 未加载

WithinReason7 months ago

Hmmm, this could be expressed as 2 consecutive attentions in a residual branch:Simplified differential T. looks like: (softmax(Q₁K₁) − λ softmax(Q₂K₂)) VYou can factor this into:<pre><code> x = softmax(Q₁K₁)V x += -λ softmax(Q₂K₂)V </code></pre> which is like 2 subsequent regular attentions added that are sharing V

评论 #41779130 未加载

评论 #41782952 未加载

h_tbob7 months ago

I wish they didn’t use swiGLU and preRMSnorm so we could have a better comparison.Then we would know how much this transformer innovation helps by itself.

digdugdirk7 months ago

Is there any way to replicate this with existing models, or are we going to need to wait for models to be trained in this style?I'm imagining a smaller model examining the output tokens of a larger model and metaphorically slapping it on the wrist with a ruler if the output tokens start drifting off topic. Not quite the same, but an entertaining thought nonetheless.

评论 #41777738 未加载

评论 #41777313 未加载

dartos7 months ago

> By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarizationI’m very interested in this claim. I was under the impression that hallucination is unavoidable in these kinds of models. IIRC proof for that was trending on HN a couple weeks ago.

评论 #41779570 未加载

评论 #41777977 未加载

评论 #41778016 未加载

mik097 months ago

r/machine learning comment thread has some interesting ideas, one of them linking this one with similar work in CV: <a href="https://www.reddit.com/r/MachineLearning/comments/1g0lnij/r_ngpt_normalized_transformer_with_representation/" rel="nofollow">https://www.reddit.com/r/MachineLearning/comments/1g0lnij/r_...</a>

lucidrains7 months ago

does this not mean we should explore usage of talking heads (Shazeer et al) a bit more? <a href="https://arxiv.org/abs/2003.02436" rel="nofollow">https://arxiv.org/abs/2003.02436</a>

评论 #41779911 未加载

x49asvk7 months ago

This concept is really interesting to me, I am very very new to transformers but would love to learn more about normal transformers and differential too. Can anyone suggest any resources?

pikseladam7 months ago

Did this mean they solved the hallucination problem of transformers?edit: not fully but it gives promising results. quiet an improvement actually.

评论 #41777894 未加载

评论 #41776873 未加载

评论 #41776887 未加载

nowayno5837 months ago

Does anyone understand why they are taking the difference between transformers instead of the sum? It seems to me that in a noise reducing solution we would be more interested in the sum, as random noise would cancel out and signal would be constructive.Of course, even if I'm right proper training would account to that by inverting signs where appropriate. Still, it seems weird to present it as the difference, especially seeing as they compare this directly to noise cancelling headphones, where we sum both microphones inputs.

评论 #41778075 未加载

评论 #41778085 未加载

badsandwitch7 months ago

What is purpose of the lambda parameter? Why isn't it a constant of 1?

esafak7 months ago

How is this different than using a sparsity-inducing prior?

magicalhippo7 months ago

The visualization reveals that Transformer tends to allocate only a small proportion of attention scores to the correct answer, while disproportionately focusing on irrelevant context.[...] Specifically, we partition the query and key vectors into two groups and compute two separate softmax attention maps. Then the result of subtracting these two maps is regarded as attention scores.[...] The approach is analogous to noise-canceling headphones and differential amplifiers in electrical engineering, where the difference between two signals cancels out common-mode noise.Simple change, with seemingly decent improvements across the board.

评论 #41776966 未加载

评论 #41776914 未加载

campers7 months ago

The tl;dr on high level performance improvements"The scaling curves indicate that Diff Transformer requires only about 65% of model size or training tokens needed by Transformer to achieve comparable language modeling performance.""Diff Transformer retains high performance even at reduced bit-widths, ranging from 16 bits to 6 bits. In comparison, Transformer’s accuracy significantly drops with 6-bit quantization. The 4-bit Diff Transformer achieves comparable accuracy as the 6-bit Transformer, and outperforms the 4-bit Transformer by about 25% in accuracy."

评论 #41776963 未加载

ExxKA7 months ago

Very interesting. Currently working on timeseries with Transformers. Let me know if anyone else out there is also reading it from that context.

评论 #41777069 未加载

31 comments

Imnimo7 months ago

评论 #41780844 未加载

评论 #41792020 未加载

评论 #41779968 未加载

评论 #41780900 未加载

评论 #41779779 未加载

aDyslecticCrow7 months ago

评论 #41780849 未加载

评论 #41778608 未加载

msoad7 months ago

评论 #41777492 未加载

评论 #41777981 未加载

评论 #41778514 未加载

评论 #41780581 未加载

评论 #41777095 未加载

评论 #41778030 未加载

评论 #41780220 未加载

评论 #41777464 未加载

评论 #41776928 未加载

islewis7 months ago

评论 #41779508 未加载

评论 #41780050 未加载

评论 #41779423 未加载

评论 #41780064 未加载

WithinReason7 months ago

We empirically find that the setting λᵢₙᵢₜ = 0.8 − 0.6 × exp(−0.3 · (l − 1)) works well in practiceI wonder about the story behind that formula...

评论 #41779303 未加载

评论 #41781032 未加载

评论 #41781407 未加载

iandanforth7 months ago

评论 #41778252 未加载

评论 #41780196 未加载

评论 #41777939 未加载

评论 #41778284 未加载

patcon7 months ago

评论 #41778047 未加载

评论 #41777640 未加载

评论 #41777437 未加载

chessgecko7 months ago

vsroy7 months ago

Is the thing that's going on here that softmax can't push a value to 0, but by subtracting 2 softmax maps we can output 0s?

评论 #41781421 未加载

评论 #41778150 未加载

machinelearning7 months ago

评论 #41780027 未加载

pxdm7 months ago

评论 #41781277 未加载

nmacias7 months ago

AdderaLLM was right there

miven7 months ago

pizza7 months ago

watsonmusic7 months ago

The modification is simple and beautiful. And the improvements are quite significant.

singularity20017 months ago

Anyone remember siamese networks?

slashdave7 months ago

I don't get it. Arbitrary linear combinations are already accommodated via feed forward. What am I missing?

评论 #41781616 未加载

评论 #41785363 未加载

WithinReason7 months ago

评论 #41779130 未加载

评论 #41782952 未加载

h_tbob7 months ago

I wish they didn’t use swiGLU and preRMSnorm so we could have a better comparison.Then we would know how much this transformer innovation helps by itself.

digdugdirk7 months ago

评论 #41777738 未加载

评论 #41777313 未加载

dartos7 months ago

评论 #41779570 未加载

评论 #41777977 未加载

评论 #41778016 未加载

mik097 months ago

lucidrains7 months ago

does this not mean we should explore usage of talking heads (Shazeer et al) a bit more? <a href="https://arxiv.org/abs/2003.02436" rel="nofollow">https://arxiv.org/abs/2003.02436</a>

评论 #41779911 未加载

x49asvk7 months ago

This concept is really interesting to me, I am very very new to transformers but would love to learn more about normal transformers and differential too. Can anyone suggest any resources?

pikseladam7 months ago

Did this mean they solved the hallucination problem of transformers?edit: not fully but it gives promising results. quiet an improvement actually.

评论 #41777894 未加载

评论 #41776873 未加载

评论 #41776887 未加载

nowayno5837 months ago

评论 #41778075 未加载

评论 #41778085 未加载

badsandwitch7 months ago

What is purpose of the lambda parameter? Why isn't it a constant of 1?

esafak7 months ago

How is this different than using a sparsity-inducing prior?

magicalhippo7 months ago

评论 #41776966 未加载

评论 #41776914 未加载

campers7 months ago

评论 #41776963 未加载

ExxKA7 months ago

Very interesting. Currently working on timeseries with Transformers. Let me know if anyone else out there is also reading it from that context.

评论 #41777069 未加载