Unlimiformer: Long-Range Transformers with Unlimited Length Input

335 点作者 shishy大约 2 年前

18 条评论

mxwsn大约 2 年前

1. This is not exact attention, but an approximation of it. Specifically, they use k-nearest neighbors to retrieve the top-k most similar tokens, out of an "unlimited-length input" say of size N, where k << N.2. This idea is quite similar to retrieval transformers and Hopfield networks which have been known and published for several years now. It's not really that novel.3. Due to the preceding points, the title can easily mislead people. It's not really a conventional transformer, and it's not a breakthrough.4. This paper is a preprint and not peer-reviewed."I generally don't enjoy seeing preprints like this going to the top of Hacker News. This would be a higher quality submission if the paper was peer-reviewed or put into a greater context, like a blog post discussion or something like that."Let me retract this and say something a bit nicer :) I personally think there this specific preprint making it to the top of HN is potentially harmful, because of the hype around LLMs, the diverse audience of readers here, and the specific title that implies a claim of "transformer with unlimited context length", when this is misleading. I don't have anything against preprints in general - a lot of work outside of the peer-review process ends up being very impactful.

评论 #35834332 未加载

评论 #35834581 未加载

评论 #35834316 未加载

评论 #35833881 未加载

评论 #35833535 未加载

评论 #35835193 未加载

评论 #35834374 未加载

GistNoesis大约 2 年前

I've read the paper quickly, the main idea is simple and interesting, but maybe a little dubious (it's kind of an accuracy for memory trade-off).In the transformer architecture one has to compute QKT.QKT=(hd * Wq * WkT)heT (equation (2) page 3 in the paper).Where hd is the hidden state of the decoder, and he is the hidden state of the encoder, and Wq and Wd are some parameters matrices, and T denotes the transposition operation.By grouping the calculation this way, in a transformer encoder-decoder architecture, they can build and use only a single index (you index the he vectors using a vector database) for all the decoder layers queries. Instead of having to build 2 * L * H indices (with L the number of layers of the decoder and H the number of head in the decoder).But what makes it a little dubious, is that this transformation mean you make your near neighbor queries in a space of dimension "dimension of the hidden state", instead of "dimension of a head" that is H times smaller.So if you had to build 2 * L * H indices each index would be H times smaller.So you only gain a factor 2 * L. But the trade-off is that you are doing a near neighbor search in higher dimension where you are then subjected to the curse of dimensionality (the higher the dimension the more similar all points are to each other). Whereas the whole point of projections in transformer is to lower the dimension so that the knn search make more sense. So to get the same accuracy, your near-neighbor search engine will have to work a lot harder.Also as an approximation of the transformer, because it's using some knn search, it comes with the problems associated with it (for example harder to train because more sparse, and a tendency to hyperfocus), but it can be complemented with low-rank linearization of the attention to also have the neural net act on the gist rather than the closest neighbors.

评论 #35836293 未加载

评论 #35835474 未加载

space_fountain大约 2 年前

As I understand it the approach here is to use an approximate nearest neighbor database to retrieve highly relevant tokens from across large documents using the existing attention heads. So each attention head retrieves context from entire document. They say this can work without fine tuning, but performance improves with it. This is apparently extending this piece of prior work, but they've managed to re-range the linear algebra of attention so they only need one database for all attention heads across all layers of the model. I'm a bit confused how attention would here for layers below the top and a bit confused about how position is encoded for tokens across a long document like this.

评论 #35834470 未加载

sva_大约 2 年前

I think infiniformer would've sounded better. The bench scores seem pretty marginal.

评论 #35833265 未加载

smusamashah大约 2 年前

What does it mean for ChatGPT and likes? Can they employ this method to virtually get rid of context tokens limit?

评论 #35833360 未加载

chrgy大约 2 年前

In the age of transformers , lets ask a transformer to summarize this paper:The Unlimiformer paper is about a new way to make computer programs that can summarize really long pieces of text. Normally, when you ask a computer program to summarize something, it can only handle a certain amount of text at once. But with Unlimiformer, the program can handle as much text as you want!The way Unlimiformer works is by using a special technique called a "k-nearest-neighbor index" to help the program pay attention to the most important parts of the text. This makes it possible for the program to summarize even really long documents without losing important information.Overall, Unlimiformer is an exciting new development in natural language processing that could make it easier for computers to understand and summarize large amounts of text.

评论 #35836550 未加载

TeMPOraL大约 2 年前

Is this how Kagi's "universal summarizer" works? They wrote a lot of copy about how it's able to summarize websites and documents of arbitrary length, while not revealing how on Earth this actually works. It does seem to work, though.

评论 #35836566 未加载

logophobia大约 2 年前

An alternative which I've used with some succes are structured state space models: <a href="https://srush.github.io/annotated-s4/" rel="nofollow">https://srush.github.io/annotated-s4/</a>. A very different approach that works well for quite a few types of problems.

opportune大约 2 年前

This seems like a definite attention optimization but I think the fundamental problem with attention is that it doesn’t handle state in a way that scales well.Personally I think the RNN/LSTM state handling approach is going to be something we revisit when trying to advance past transformers. It handles state in a way that generalizes and scales better (it should in theory learn an attention-like mechanism anyway, and state is independent of input size).It may be harder to train, and require further improvements, but it really seems more like an engineering or cost problem than a theoretical one. But I’m only an amateur and not an expert. Maybe continued improvement on attention will approach generalized state handling in a way that efficiently trains better than improvements on more generalized stateful approaches improve training.

nephanth大约 2 年前

Btw, why do transformers have a limit input size in the first place? I'm pretty sure the self-attention mechanisms scale (although with bad complexity) to arbitrary sizes

评论 #35835605 未加载

ztratar大约 2 年前

Given the model performance is thus affected by a k-nearest neighbor, but those algorithms are proving not great for baseline vector search, how well will this actually work?It seems mostly like a vertically integrated vector DB + existing LLM call, but correct me if I'm wrong. There are of course some performance gains with that, but the holy grail of "understanding" at unlimited length still seems unsolved.

评论 #35833776 未加载

ftxbro大约 2 年前

Other times this was put on hacker news:<a href="https://news.ycombinator.com/item?id=35823039" rel="nofollow">https://news.ycombinator.com/item?id=35823039</a><a href="https://news.ycombinator.com/item?id=35803470" rel="nofollow">https://news.ycombinator.com/item?id=35803470</a>

评论 #35834604 未加载

XorNot大约 2 年前

Hang on, how unlimited is unlimited here? Surely the immediate thing you'd do with this is just never delete any prior inputs so it becomes defacto long term memory for the model?

评论 #35833184 未加载

评论 #35833272 未加载

intalentive大约 2 年前

The ML community keeps rediscovering the work of Steve Grossberg. This is very similar to his decades-old ART model.

评论 #35840815 未加载

szundi大约 2 年前

Input should be the Internet then.

评论 #35835863 未加载

jfisher4024大约 2 年前

Neubig is the real deal. I’d take this paper seriously.

j2kun大约 2 年前

Appendix C of the PDF has an example book summary of The Brothers Karamazov, with noted minor factual errors<a href="https://arxiv.org/pdf/2305.01625.pdf" rel="nofollow">https://arxiv.org/pdf/2305.01625.pdf</a>> Unlimiformer summary:> The first part of the novel focuses on the question of whether or not the Russian nobleman, Dmitri Fyodorovitch, has killed his father. In the town of Ivanovna, the lieutenant-colonel of the Mushenkhanovitch is accused of the murder of his brother Ivanovitch. The lieutenant-incommand, Vasilyevitch, takes the form of a dog, and the two men–the two men and the woman who are questioned by the court-martial–murphy. The two men cry out to the God of Russia for help in their quest to save the town. The man, afraid of the wrath of the God, hands the dog a bunch of letters that are supposed to be proof of his love for his brother. The old man–the one who had killed his mother, and then found the letter–arrives. He reads it–asked the old man to forgive him for the murder and then takes the dog away. The other men, all of whom are prisoners, demand that the man confess his crime to the court. The first and most important thing they tell the court is that they love the man. The court acquits the man and sentences the man to death. The second man–an old officer of the town, Alekandrovitch–askes to tell them the same thing. The third man–in the process of confessing his crime–is Vashenka, a drunk man who has been sent to the town to kill his father, for reasons which are not entirely clear to the people. The woman’s servant, Evgenyevna, is also the one who has told the court the story of the Medvedevitch’s murder, for the good old man’s and the young man’s love. The three men, who are separated for the first time, are laughing at the man’s attempt to seduce Mitya. The young man, in the meantime, is conscripted into the town-side. He tells the court that he loves her, but he has yet to tell her the true story. The men, in this room, demand a man to kill her, and she will not betray them. The women, in their own country, are rebelling against the man who had sent them three thousand roubles, and they will not allow the man of the people to see them. They will not let the man in the town be allowed to see the man–or Dmitriovitch; he will have her husband killed him. He will not tell the people who love him. The next man, named Vashenovitch, arrives, and takes the man away. They all begin to laugh at the fact that he has succeeded in seducing and entrusting his brother Dmitri. He is then taken away to the old woman’s house, where the governor-side-of-the-world, and his sister, Arkadin, is being punished. The priestesses and the baron are shocked, for they have been so virtuous and well-suited. The only thing they will be able to do is kill the priest. They threaten to burn the priestess to death, for she has been so wicked and libidinous that she has not yet seen the priest, for her husband. The priests–ostensibly convinced that she is a woman who loves the priest and has been punished for her love and for allowing the priest to marry her. The last man, Yakivitch, arrives at the house, and, after a long day of drinking and then some of the men–is killed. He and the priest are ordered to leave the town so that the priest can finally be reunited with the people of the old lady. The final man, the commander of the St. Petersburg town of Arkadina, is sentenced to death for the crime of having killed and then the lieutenant of the governor, for taking the money. The commander, the former lieutenant-delegation of the People’s Army, is summarily executed, and all the men, except for the commander, have been summarily punished for their crime. The entire town is shocked and, in a very dramatic way, the priestesses plead for the forgiveness of the man, for allowing them to kill and imprison Ivan. They plead for their brother to be restored as well, for all the people they have loved, and for the priestor to tell the story

评论 #35835357 未加载

评论 #35833666 未加载

adamnemecek大约 2 年前

The attention mechanism corresponds to the Hopf algebraic convolution, a generalization of the commonly known convolution.I'm in the process of implementing a framework based on this idea.I have written a paper on this recently, <a href="https://arxiv.org/abs/2302.01834" rel="nofollow">https://arxiv.org/abs/2302.01834</a>I have a discord channel <a href="https://discord.cofunctional.ai" rel="nofollow">https://discord.cofunctional.ai</a>.

评论 #35834321 未加载