This is the commit that changed it:
<a href="https://github.com/tensorflow/tensor2tensor/commit/d5bdfcc85fa3e10a73902974f2c0944dc51f6a33">https://github.com/tensorflow/tensor2tensor/commit/d5bdfcc85...</a>
This note contains four papers for "historical perspective"... which would usually mean "no longer directly relevant", although I'm not sure that's really what the author means.<p>You might be looking for the author's "Understanding Large Language Models" post [1] instead.<p>Misspelling "Attention is All Your Need" twice in one paragraph makes for a rough start to the linked post.<p>[1] <a href="https://magazine.sebastianraschka.com/p/understanding-large-language-models" rel="nofollow">https://magazine.sebastianraschka.com/p/understanding-large-...</a>
I wonder if for example a function is an example of a transformer. So the phrase "argument one is cat" and argument two is dog and operation is join so the result is the word catdog is operated by the transformer as the function concat(cat,dog). Here the query is the function and the keys are the argument for the function and the value is a function from word to words.
The actual title "Why the Original Transformer Figure Is Wrong, and Some Other Interesting Historical Tidbits About LLMs" is way more representative of what this post is about...<p>As to the figure being wrong, it's kind of a nit-pick:
"While the original transformer figure above (from Attention Is All Your Need, <a href="https://arxiv.org/abs/1706.03762" rel="nofollow">https://arxiv.org/abs/1706.03762</a>) is a helpful summary of the original encoder-decoder architecture, there is a slight discrepancy in this figure.<p>For instance, it places the layer normalization between the residual blocks, which doesn't match the official (updated) code implementation accompanying the original transformer paper. The variant shown in the Attention Is All Your Need figure is known as Post-LN Transformer."