Visualizing Attention, a Transformer's Heart [video]

999 pointsby rohitpaulkabout 1 year ago

23 comments

Xcelerateabout 1 year ago

As someone with a background in quantum chemistry and some types of machine learning (but not neural networks so much) it was a bit striking while watching this video to see the parallels between the transformer model and quantum mechanics.In quantum mechanics, the state of your entire physical system is encoded as a very high dimensional normalized vector (i.e., a ray in a Hilbert space). The evolution of this vector through time is given by the time-translation operator for the system, which can loosely be thought of as a unitary matrix U (i.e., a probability preserving linear transformation) equal to exp(-iHt), where H is the Hamiltonian matrix of the system that captures its “energy dynamics”.From the video, the author states that the prediction of the next token in the sequence is determined by computing the next context-aware embedding vector from the last context-aware embedding vector alone. Our prediction is therefore the result of a linear state function applied to a high dimensional vector. This seems a lot to me like we have produced a Hamiltonian of our overall system (generated offline via the training data), then we reparameterize our particular subsystem (the context window) to put it into an appropriate basis congruent with the Hamiltonian of the system, then we apply a one step time translation, and finally transform the resulting vector back into its original basis.IDK, when your background involves research in a certain field, every problem looks like a nail for that particular hammer. Does anyone else see parallels here or is this a bit of a stretch?

评论 #40039069 未加载

评论 #40038432 未加载

评论 #40038671 未加载

评论 #40043229 未加载

评论 #40047663 未加载

评论 #40038485 未加载

seydorabout 1 year ago

I have found the youtube videos by CodeEmporium to be simpler to follow <a href="https://www.youtube.com/watch?v=Nw_PJdmydZY" rel="nofollow">https://www.youtube.com/watch?v=Nw_PJdmydZY</a>Transformer is hard to describe with analogies, and TBF there is no good explanation why it works, so it may be better to just present the mechanism, "leaving the interpretation to the viewer". Also, it's simpler to describe dot products as vectors projecting on one another

评论 #40037770 未加载

评论 #40044424 未加载

评论 #40037765 未加载

rayvalabout 1 year ago

Here's a compelling visualization of the functioning of an LLM when processing a simple request: <a href="https://bbycroft.net/llm" rel="nofollow">https://bbycroft.net/llm</a>This complements the detailed description provided by 3blue1brown

评论 #40038360 未加载

评论 #40043818 未加载

tylerneylonabout 1 year ago

Awesome video. This helps to show how the Q*K matrix multiplication is a bottleneck, because if you have sequence (context window) length S, then you need to store an SxS size matrix (the result of all queries times all keys) in memory.One great way to improve on this bottleneck is a new-ish idea called Ring Attention. This is a good article explaining it:<a href="https://learnandburn.ai/p/how-to-build-a-10m-token-context" rel="nofollow">https://learnandburn.ai/p/how-to-build-a-10m-token-context</a>(I edited that article.)

评论 #40037419 未加载

评论 #40037275 未加载

promiseofbeansabout 1 year ago

His previous post 'But what is a GPT?' is also really good: <a href="https://www.3blue1brown.com/lessons/gpt" rel="nofollow">https://www.3blue1brown.com/lessons/gpt</a>

YossarianFrPrezabout 1 year ago

This video (with a slightly different title on YouTube) helped me realize that the attention mechanism isn't exactly a specific function so much as it is a meta-function. If I understand it correctly, Attention + learned weights effectively enables a Transformer to learn a semi-arbitrary function, one which involves a matching mechanism (i.e., the scaled dot-product.)

评论 #40036339 未加载

abotsisabout 1 year ago

I think what made this so digestible for me were the animations. The timing, how they expand/contract and unfold while he’s speaking.. is all very well done.

评论 #40037294 未加载

nostreboredabout 1 year ago

Working in a closely related space and this instantly became part of my team's onboarding docs.Worth noting that a lot of the visualization code is available in Github.<a href="https://github.com/3b1b/videos/tree/master/_2024/transformers">https://github.com/3b1b/videos/tree/master/_2024/transformer...</a>

评论 #40036011 未加载

bilsbieabout 1 year ago

I finally understand this! Why did every other video make it so confusing!

评论 #40036212 未加载

评论 #40037072 未加载

评论 #40036630 未加载

评论 #40039664 未加载

评论 #40037930 未加载

评论 #40037477 未加载

评论 #40036216 未加载

shahbazacabout 1 year ago

Is there a reference which describes how the current architecture evolved? Perhaps from very simple core idea to the famous “all you need paper?”Otherwise it feels like lots of machinery created out of nowhere. Lots of calculations and very little intuition.Jeremy Howard made a comment on Twitter that he had seen various versions of this idea come up again and again - implying that this was a natural idea. I would love to see examples of where else this has come up so I can build an intuitive understanding.

评论 #40041128 未加载

评论 #40040255 未加载

nameloswabout 1 year ago

You might also want to check out other 3b1b videos on neural networks since there are sort of progressions between each video <a href="https://www.3blue1brown.com/topics/neural-networks" rel="nofollow">https://www.3blue1brown.com/topics/neural-networks</a>

jiggawattsabout 1 year ago

It always blows my mind that Grant Sanderson can explain complex topics in such a clear, understandable way.I've seen several tutorials, visualisations, and blogs explaining Transformers, but I didn't fully understand them until this video.

评论 #40036229 未加载

mastaziabout 1 year ago

That example with the "was" token at the end of a murder novel is genius (at 3:58 - 4:28 in the video) really easy for a non technical person to understand.

评论 #40036446 未加载

justanotherjoeabout 1 year ago

It seems he brushes over the positional encoding, which for me was the most puzzling part of transformers. The way I understood it, positional encoding is much like dates. Just like dates, there are repeating minutes, hours, days, months...etc. Each of these values has shorter 'wavelength' than the next. The values are then used to identify the position of each tokens. Like, 'oh, im seeing january 5th tokens. I'm january 4th. This means this is after me'. Of course the real pos.encoding is much smoother and doesn't have abrupt end like dates/times, but i think this was the original motivation for positional encodings.

评论 #40037790 未加载

bjornsingabout 1 year ago

This was the best explanation I’ve seen. I think it comes down to essentially two aspects: 1) he doesn’t try to hide complexity and 2) he explains what he thinks is the purpose of each computation. This really reduces the room for ambiguity that ruins so many other attempts to explain transformers.

stillsutabout 1 year ago

In training we learn a.) the embeddings and b.) the KQ/MLP-weights.How well do Transformers perform given learned embeddings but only randomly initialized decoder weights? Do they produce word soup of related concepts? Anything syntactically coherent?Once a well trained high dimensional representation of tokens are established. can they learn KQ/MLP weights significantly faster?

rollinDynoabout 1 year ago

Hold on, every predicted token is only a function of the previous token? I must have something wrong. This would mean that within the embedding of "was", which is of length 12,228 in this example. Is it really possible that this space is so rich as to have a single point in it encapsulate a whole novel?

评论 #40038385 未加载

评论 #40064275 未加载

评论 #40047971 未加载

评论 #40036939 未加载

评论 #40036926 未加载

评论 #40043477 未加载

kordlessagainabout 1 year ago

What I'm now wondering about is how intuition to connect completely separate ideas works in humans. I will have very strong intuition something is true, but very little way to show it directly. Of course my feedback on that may be biased, but it does seem some people have "better" intuition than others.

thomasahleabout 1 year ago

I like the way he uses a low-rank decomposition of the Value matrix instead of Value+Output matrices. Much more intuitive!

评论 #40037536 未加载

cs702about 1 year ago

Fantastic work by Grant Sanderson, as usual.Attention has won.[a]It deserves to be more widely understood.---[a] Nothing has outperformed attention so far, not even Mamba: <a href="https://arxiv.org/abs/2402.01032" rel="nofollow">https://arxiv.org/abs/2402.01032</a>

mehulashahabout 1 year ago

This is one of the best explanations that I’ve seen on the topic. I wish there was more work, however, not on how Transfomers work, but why they work. We are still figuring it out, but I feel that the exploration is not at all systematic.

spacecadetabout 1 year ago

Fun video. Much of my "art" lately has been dissecting models, injecting or altering attention, and creating animated visualizations of their inner workings. Some really fun shit.

评论 #40036532 未加载

评论 #40036517 未加载

kjhennerabout 1 year ago

The first time I really dug into transformers (back in the BERT days) I was working on a MS thesis involving link prediction in a graph of citations among academic documents. So I had graphs on the brain.I have a spatial intuition for transformers as a sort of analog to a message passing network over a "leaky graph" in an embedding space. If each token is a node, its key vector sets the position of an outlet pipe that it spews value to diffuse out into the embedding space, while the query vector sets the position of an input pipe that sucks up value other tokens have pumped out into the same space. Then we repeat over multiple attention layers, meaning we have these higher order semantic flows through the space.Seems to make a lot of sense to me, but I don't think I've seen this analogy anywhere else. I'm curious if anybody else thinks of transformers in this way. (Or wants to explain how wrong/insane I am?)