Efficient streaming language models with attention sinks

421 pointsby guywithabowtieover 1 year ago

23 comments

I think people are misreading this work, and assuming this is equivalent to full dense-attention. This is just saying its an efficiency gain over sliding window re-computation, where instead of computing the L^2 cost over and over (T times), you can re-use a cache and maintain perplexity. I don't think they are claiming that this allows for attending to content that was far away.They tested by running concatenating and measuring -> `Q A Q A Q A Q A...` not by doing `Q Q Q Q A A A A...`They also measure perplexity, showing that it produces "readable text" (coherent, locally viable); not that it is "extracting anything" from the big-triangle-gap of no-attention.I think this would fail to be given a book, then write the first word of every paragraph. Or, given a book, write a 1 sentence summary of each chapter. I might be wrong, because they didn't test tasks like this, but I'd be very very surprised.

评论 #37743273 未加载

评论 #37743708 未加载

cs702over 1 year ago

On a first quick pass, this looks so good that I'm wondering if it's too good to be true!But the work looks to be of decent quality and the technique is remarkably straightforward:The idea is to apply attention over the first token and a sliding context window, ignoring everything in-between, in each layer.By implication, each layer must be gradually shifting relevant information forward in the sequence, enabling the top layer's ending sliding attention window to see it.The only caveat I can think of is that the sliding windows won't be able to shift all important information forward when the span of all sliding windows isn't sufficient to span the entire sequence -- for example, when model depth × window length < sequence length, if all windows have the same length.

评论 #37745612 未加载

评论 #37742158 未加载

评论 #37741919 未加载

huevosabioover 1 year ago

This seems to be largely enabled by the observation that Softmax has to add up to one. From quick a glance [1], the model tends to use the first token as a placeholder for cases when you don't need to attend any of the prior tokens.The first time I read about this issue, that Softmax is somewhat flawed, was in a HN post by Evan Miller [2] where he observes that forcing attention heads to allocate all attention to prior tokens is wrong, and we should allow them to "not attend" by adding one to the softmax denominator.I love that they found a way to capitalize on this observation without having to retrain models. However, I wonder how the models would look like if they followed Evan's suggestion![1] Their description of attention sinks:```To understand the failure of window attention, we find an interesting phenomenon of autoregressive LLMs: a surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task, as visualized in Figure 2. We term these tokens “attention sinks". Despite their lack of semantic significance, they collect significant attention scores. We attribute the reason to the Softmax operation, which requires attention scores to sum up to one for all contextual tokens. Thus, even when the current query does not have a strong match in many previous tokens, the model still needs to allocate these unneeded attention values somewhere so it sums up to one. The reason behind initial tokens as sink tokens is intuitive: initial tokens are visible to almost all subsequent tokens because of the autoregressive language modeling nature, making them more readily trained to serve as attention sinks.```[2] <a href="https://news.ycombinator.com/item?id=36851494">https://news.ycombinator.com/item?id=36851494</a>

评论 #37742580 未加载

评论 #37743736 未加载

smeethover 1 year ago

Adding attention cache memory is an extremely interesting solution to this problem.If anyone is curious, there was another paper [0] that came out a few days ago that made a related observation in Vision Transformers. Transformer models appear to pick tokens to store global information in - they need tokens to "think". You can eek some performance improvements (and cool explanation images) by providing the model with specific tokens for this purpose.[0] <a href="https://arxiv.org/pdf/2309.16588.pdf" rel="nofollow noreferrer">https://arxiv.org/pdf/2309.16588.pdf</a>

评论 #37742534 未加载

Van_Chopisztover 1 year ago

The authors just uploaded a FAQ section, which may clarify some of the confusions: <a href="https://github.com/mit-han-lab/streaming-llm/blob/main/README.md#faq">https://github.com/mit-han-lab/streaming-llm/blob/main/READM...</a>

评论 #37743338 未加载

footaover 1 year ago

I could be wrong, but I'm not sure this is about what people seem to think it is, e.g., letting LLMs reference content past the trained lengthI think it may just be about the performance of the model with longer texts (on the things still within the context window?). It sounds like they're arguing that the model is essentially learning to stick some baggage in the attention to the initial tokens of the text, and break when that isn't within the window anymore for reasons I'm not sure I understand (after all, isn't text in the middle just as good as text at the start for non instruction inputs?)

doctobogganover 1 year ago

How do any of these sliding window techniques handle instructions that are non expected and only show up at the end? For example imagine feeding a book to the model and the last sentence being the instruction “return the count of the letter m in the previous input”. A human would handle this by first letting out an exasperated sigh but then restarting the reading while counting. An LLM has no ability to loop back and re-read the input. (Ignore LLM issues with character counting for this example). It seems like to solve this problem for real the LLM needs to be able to loop and jump arbitrarily, but I’m sure that would introduce a whole new host of issues and possibly require a new architecture all together.

评论 #37746168 未加载

评论 #37742081 未加载

评论 #37742097 未加载

评论 #37742080 未加载

评论 #37742141 未加载

评论 #37742271 未加载

iandanforthover 1 year ago

My somewhat facetious take is that LLMs are trying really hard to reinvent RNNs and would do so if we just gave them the tools to do so.

评论 #37742334 未加载

评论 #37742452 未加载

评论 #37742344 未加载

评论 #37742346 未加载

评论 #37742263 未加载

__rito__over 1 year ago

Relevant: the eponymous Professor Han at MIT is teaching a TinyML course that is open to the public. See:- <a href="https://news.ycombinator.com/item?id=37620507">https://news.ycombinator.com/item?id=37620507</a>- <a href="https://efficientml.ai" rel="nofollow noreferrer">https://efficientml.ai</a>

refulgentisover 1 year ago

This looks fantastic. Also answers the relevancy of the "off-by-one" softmax*My naive question is...does it work? But that sounds dismissive. At length:It shows that the model can't respond after a certain length versus a proposed model that does continue to respond.But can a model that continues to respond retrieve information far "in the past"?The demo video is too low-level, at least to my brain. It shows one model stops responding but the proposed one continues.I spent about 5 minutes going frame by frame to see if the proposed model attempts to have to "recall" information from further back, but it looks like no.Perfection here isn't necessary or even possible AFAIK, i.e. I don't expect it to recall page 1 100% accurately at page 1000. But can it recall _anything_ from it, even if it ignores it?The great thing about this era and work is we can check. But I hope someone has it up in a HuggingFace space before I figure out how to run it myself. :PI'm leaning no, based on the sliding window thing. It sounds like there's 4 fixed tokens, then the last context size - 4 tokens, that's it* at the time, two camps: one, it's some random person saying it and there's prior art on implementations that do the off-by-one. Two, you'd be surprised how much little things go unnoticed by large groups, and do matter.

WhatsNameover 1 year ago

So I can let llama2 summarize books now or are there any non-obvious caveats to this approach?

评论 #37746352 未加载

评论 #37744186 未加载

dheeraover 1 year ago

I feel like information theory prevents full information retention for unlimited context lengths and finite compute, but I don't know if we are at information theory limits to invoke this argument. Or rather, I don't know how to make a good analysis of (bits of context information) per (bits of model parameters).

ilovefoodover 1 year ago

This is working relatively well, the code is really worth a read. If you run it locally, consider the open PR and install sentencepiece as well. It's been generating text for the past 10 minutes now :DSome of the instructions are ignored though so I'd be careful there, one instruction is to rewrite the previous response by "starting every sentence with the letter A" which is a bit of a hit or miss right now.

评论 #37743043 未加载

Filligreeover 1 year ago

Okay, what's the downside this time?

评论 #37741645 未加载

13yearsover 1 year ago

So can it now understand and write complete applications?

评论 #37742058 未加载

torginusover 1 year ago

Is it just me, or does every approach basically boil down to not wanting to pay the full quadratic cost over the context (usually by selecting which tokens to pay attention to, or using some computationally cheaper substitute for each token).I feel like all these approaches kind of equivalent to a fully dense attention matrix over a smaller context, but carefully curating what goes into the context, also known to us humans as summarizing each bit of text, or (perhaps less efficiently) going through a textbook with a highlighter.My intuition is that the winning approach will be a small (ish), lets say 8k context, with efficient an summarization and dynamic information retrieval scheme.

choegerover 1 year ago

Did anyone ever attempt a recursive architecture?So you take the first window or logical separatation (chapter, paragraph) and let the model summarize it into one or two sentences. Then you repeat that with the next window (and that derived sentence as context) and create a new logical separatation out out of a fixed number of sentences. Rinse and repeat until the result fits into your window.I have a hunch that this is somewhat how the brain works when reading.

评论 #37751753 未加载

idiotsecantover 1 year ago

This is a big claim, curious to see what the caveats are.

Trapaisover 1 year ago

Looks like longformer to me. They just renamed "global attention" into "attention sink" and removed silly parts(distilled attention) and BERT parts([CLS] saw all N tokens, there is no need for BOS to see all tokens)

regularfryover 1 year ago

Anyone got a gut feel as to whether you could use this to transform Whisper into a better streaming model? It's a bit of a hack using it that way at the moment.

heavyarmsover 1 year ago

Having only read the abstract, I'm probably way off the mark here, but my first thought was: LLM + LSTM.

评论 #37742053 未加载

guywithabowtieover 1 year ago

We introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.

评论 #37742161 未加载

评论 #37742247 未加载

kridsdale3over 1 year ago

What if my "Favorite LLM" is GPT4? I don't want to use Llama or anything like that. Does this GitHub code let me use the OpenAI API and run the new memory technique on top of that?

评论 #37742365 未加载