I love the way Andrej Karpathy explains things. The code for the feed-forward block of a transformer looks like this:<p><pre><code> def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
</code></pre>
This how Andrej describes it (starting at 19:00 into the video):<p>"This is the pre-normalization version, where you see that x first goes through the layer normalization [ln_1] and then the attention (attn), and then goes back out to go to the layer normalization number two and the multilayer perceptron [MLP], sometimes also referred to as feed-forward network, FFN, and then that goes into the residual stream again."<p>"And the one more thing that's kind of interesting to note is: recall that attention is a communication operation, it is where all the tokens - and there's 1024 tokens lined up in a sequence - this is where the tokens communicate, where they exchange information... so, attention is an aggregation function, it's a pooling function, it's a weighted sum function, it is a <i>reduce</i> operation, whereas this MLP [multilayer perceptron] happens every single token individually - there's no information being collected or exchanged between the tokens. So the attention is the reduce, and the MLP is the <i>map</i>."<p>"And the transformer ends up just being repeated application of map-reduce, if you wanna think about it that way."
H1: <i>build nanoGPT</i><p>Title: <i>Video+code lecture on building nanoGPT from scratch</i><p><i>Please don't do things to make titles stand out (..) Otherwise please use the original title</i> <a href="https://news.ycombinator.com/newsguidelines.html">https://news.ycombinator.com/newsguidelines.html</a>