This is a _great_ article. One of the things I enjoy most is finding new ways to understand or think about things I already feel like I know. This article helped me do both with transformer networks. I especially liked how explicitly and simply things were explained like queries, keys, and values; permutation equivariance; and even the distinction between learned model parameters and parameters derived from the data (like the attention weights).<p>The author quotes Feynman, and I think this is a great example of his concept of explaining complex subjects in simple terms.