I think there are two layers of the 'why' in machine learning.<p>When you look at a model architecture it is described as a series of operations that produces the result.<p>There is a lower level why, which, while being far from easy to show, describes why it is that these algorithms produce the required result. You can show why it's a good idea to use cosine similarity, why cross entropy was chosen to express the measurement. In Transformers you can show that the the Q and K matrices transform the embeddings into spaces that allows different things to be closer, and using that control over the proportion of closeness allows you to make distinctions. This form of why is the explanation usually given in papers. It is possible to methodically show you will get the benefits described from techniques proposed.<p>The greater Why is much much harder, Harder to identify and harder to prove. the First why can tell you that something works, but it can't really tell you why it works in a way that can inform other techniques.<p>In the Transformer, the intuition is that the 'Why' is something along the lines of The Q transforms embeddings into an encoding of what information is needed in the embedding to resolve confusion, and that the K transforms embeddings into information to impart. When there's a match between 'What I want to know about' and 'what I know about' the V can be used as 'the things I know' to accumulate the information where it needs to be.<p>It's easy to see why this is the hard form, Once you get into the higher semantic descriptions of what is happening, it is much harder to prove that this is actually what is happening, or that it gives the benefits you think it might. Maybe Transformers don't work like that. Sometimes semantic relationships appear to be in processes when there is an unobserved quirk of the mathematics that makes the result coincidentally the same.<p>In a way I think of the maths of it as picking up a many dimentional object in each hand and magically rotating and (linearly) squishing them differently until they look aligned enough to see the relationship I'm looking at and poking those bits towards each other. I can't really think about that and the semantic "what things want to know about" at the same time, even though they are conceptualisations of the same operation.<p>The advantage of the lower why is that you can show that it works. The advantage of the upper why is that it can enable you to consider other mechanisms that might do the same function. They may be mathematically different but achieve the goal.<p>To take a much simpler example in computer graphics. There are many ways to draw a circle with simple loops processing mathematically provable descriptions of a circle. The Bressenham Circle drawing algorithm does so with a why that shows why it makes a circle but the "Why do it that way" was informed by a greater understanding of what the task being performed was.