The most clarifying post I've read on attention is from Cosma Shalizi[0] who points out that "Attention" is quite literally just a re-discovery/re-invention of Kernel smoothing. Probably less helpful if you don't come from a quantitative background, but if you do it makes it shockingly clarifying.<p>Once you realize this "Multi-headed Attention" is just kernel smoothing with more kernels and doing some linear transformation on the results of these (in practice: average or add)!<p>0. <a href="http://bactra.org/notebooks/nn-attention-and-transformers.html#attention" rel="nofollow">http://bactra.org/notebooks/nn-attention-and-transformers.ht...</a>
This looks very interesting.
The easiest way to navigate to the start of this series of articles seems to be
<a href="https://www.gilesthomas.com/til-deep-dives/page/2" rel="nofollow">https://www.gilesthomas.com/til-deep-dives/page/2</a><p>Now if I only could find some time...
Must be my ignorance but everytime I see explainers for LLMs similar to the post, it’s hard to believe that AGI is upon us. It just doesn’t feel that “intelligent” but again might just be my ignorance.
I'm reading through the book the blog mentions right now and building a small LLM. I'm only on chapter 2, but so far it's helped clarify a lot of things about LLMs and break it down into small steps. Highly recommend Building a large language model from scratch
I do wonder if it is in the book authors interest if some people blog and summarize the whole books content?
Or even more interesting: Would it be fine if I let an LLM summarize a book and create such a series of blog posts?
If you are interested in this sort of thing, you might want to take a look at a very simple neural network with two attention heads that runs right in the browser in pure Javascript, you can view source on this implementation:<p><a href="https://taonexus.com/mini-transformer-in-js.html" rel="nofollow">https://taonexus.com/mini-transformer-in-js.html</a><p>Even after training for a hundred epochs it really doesn't work very well (you can test it in the Inference tab after training it), but it doesn't use any libraries, so you can see the math itself in action in the source code.
I think there are two layers of the 'why' in machine learning.<p>When you look at a model architecture it is described as a series of operations that produces the result.<p>There is a lower level why, which, while being far from easy to show, describes why it is that these algorithms produce the required result. You can show why it's a good idea to use cosine similarity, why cross entropy was chosen to express the measurement. In Transformers you can show that the the Q and K matrices transform the embeddings into spaces that allows different things to be closer, and using that control over the proportion of closeness allows you to make distinctions. This form of why is the explanation usually given in papers. It is possible to methodically show you will get the benefits described from techniques proposed.<p>The greater Why is much much harder, Harder to identify and harder to prove. the First why can tell you that something works, but it can't really tell you why it works in a way that can inform other techniques.<p>In the Transformer, the intuition is that the 'Why' is something along the lines of The Q transforms embeddings into an encoding of what information is needed in the embedding to resolve confusion, and that the K transforms embeddings into information to impart. When there's a match between 'What I want to know about' and 'what I know about' the V can be used as 'the things I know' to accumulate the information where it needs to be.<p>It's easy to see why this is the hard form, Once you get into the higher semantic descriptions of what is happening, it is much harder to prove that this is actually what is happening, or that it gives the benefits you think it might. Maybe Transformers don't work like that. Sometimes semantic relationships appear to be in processes when there is an unobserved quirk of the mathematics that makes the result coincidentally the same.<p>In a way I think of the maths of it as picking up a many dimentional object in each hand and magically rotating and (linearly) squishing them differently until they look aligned enough to see the relationship I'm looking at and poking those bits towards each other. I can't really think about that and the semantic "what things want to know about" at the same time, even though they are conceptualisations of the same operation.<p>The advantage of the lower why is that you can show that it works. The advantage of the upper why is that it can enable you to consider other mechanisms that might do the same function. They may be mathematically different but achieve the goal.<p>To take a much simpler example in computer graphics. There are many ways to draw a circle with simple loops processing mathematically provable descriptions of a circle. The Bressenham Circle drawing algorithm does so with a why that shows why it makes a circle but the "Why do it that way" was informed by a greater understanding of what the task being performed was.
Regarding this statement about semantic space:<p>> so long as vectors are roughly the same length, the dot product is an indication of how similar they are.<p>This potential length difference is the reason "Cosine Similarity" is used instead of dot products for concept comparisons. Cosine similarity is like a 'scale independent dot product', which represents a concept of similarity, independent of "signal strength".<p>However, if two vectors point in the same direction, but one is 'longer' (higher magnitude) than the other, then what that indicates "semantically" is that the longer vector is indeed a "stronger signal" of the same concept. So if "happy" has a vector direction then "very happy" should be longer vector but in the same direction.<p>Makes me wonder if there's a way to impose a "corrective" force upon model weights evolution during training so that words like "more" prefixed in front of a string can be guaranteed to encode as a vector multiple of said string? Not sure how that would work with back-propagation, but applying certain common sense knowledge about how the semantic space structures "must be" shaped could potentially be the next frontier of LLM development beyond transformers (and by transformers I really mean the attention heads specialization)
Off topic rant: I hate blog posts which quote the author's earlier posts. They should just reiterate if it is important or use a link if not. Otherwise it feels like they want to fill some space without any extra work. The old posts are not that groundbreaking, I assure you. /rant