TechEcho

2 comments

codewiz11 months ago

I love the way Andrej Karpathy explains things. The code for the feed-forward block of a transformer looks like this:<pre><code> def forward(self, x): x = x + self.attn(self.ln_1(x)) x = x + self.mlp(self.ln_2(x)) return x </code></pre> This how Andrej describes it (starting at 19:00 into the video):"This is the pre-normalization version, where you see that x first goes through the layer normalization [ln_1] and then the attention (attn), and then goes back out to go to the layer normalization number two and the multilayer perceptron [MLP], sometimes also referred to as feed-forward network, FFN, and then that goes into the residual stream again.""And the one more thing that's kind of interesting to note is: recall that attention is a communication operation, it is where all the tokens - and there's 1024 tokens lined up in a sequence - this is where the tokens communicate, where they exchange information... so, attention is an aggregation function, it's a pooling function, it's a weighted sum function, it is a reduce operation, whereas this MLP [multilayer perceptron] happens every single token individually - there's no information being collected or exchanged between the tokens. So the attention is the reduce, and the MLP is the map.""And the transformer ends up just being repeated application of map-reduce, if you wanna think about it that way."

gnabgib11 months ago

H1: build nanoGPTTitle: Video+code lecture on building nanoGPT from scratchPlease don't do things to make titles stand out (..) Otherwise please use the original title <a href="https://news.ycombinator.com/newsguidelines.html">https://news.ycombinator.com/newsguidelines.html</a>

karpathy/build-nanogpt: Video + code lecture on building nanoGPT from scratch

2 comments

karpathy/build-nanogpt: Video + code lecture on building nanoGPT from scratch

2 comments