This article describes positional encodings based on several sine waves with different frequencies, but I've also seen positional "embeddings" used, where the position (the position is an integer value) is used to select an differentiable embedding from an embedding table. Thus, the model learns its own positional encoding. Does anyone know how these compare?<p>I've also wondered why we add the positional encoding to the value, rather than concatenating them?<p>Also, the terms encoding, embedding, projection, and others are all starting to sound the same to me. I'm not sure exactly what the difference is. Linear projections start to look like embeddings start to look like encodings start to look like projections, etc. I guess that's just the nature of linear algebra? It's all the same? The data is the computation, and the computation is the data. Numbers in, numbers out, and if the wrong numbers come out then God help you.<p>I digress. Is there a distinction between encoding, embedding, and projection I should be aware of?<p>I recently read in "The Little Learner" book that finding the right parameters <i>is</i> learning. That's the point. Everything we do in deep learning is focused on choosing the right sequence of numbers and we call those numbers <i>parameters</i>. Every parameter has a specific role in our model. <i>Parameters</i> are our choice, those are the nobs that we (as a personified machine learning algorithm) get to adjust. Ever since then the word "parameters" has been much more meaningful to me. I'm hoping for similar clarity with these other words.
An early explainer of transformers, which is a quicker read, that I found very useful when they were still new to me, is The Illustrated Transformer[1], by Jay Alammar.<p>A more recent academic but high-level explanation of transformers, very good for detail on the different flow flavors (e.g. encoder-decoder vs decoder only), is Formal Algorithms for Transformers[2], from DeepMind.<p>[1] <a href="https://jalammar.github.io/illustrated-transformer/" rel="nofollow">https://jalammar.github.io/illustrated-transformer/</a>
[2] <a href="https://arxiv.org/abs/2207.09238" rel="nofollow">https://arxiv.org/abs/2207.09238</a>
Thank you for sharing!<p>For the "from scratch" version, I recommend "The GPT-3 Architecture, on a Napkin" <a href="https://dugas.ch/artificial_curiosity/GPT_architecture.html" rel="nofollow">https://dugas.ch/artificial_curiosity/GPT_architecture.html</a>, which was there as well (<a href="https://news.ycombinator.com/item?id=33942597" rel="nofollow">https://news.ycombinator.com/item?id=33942597</a>).<p>Then, to actually dive into details, "The Annotated Transformer", i.e. a walktrough "Attention Is All You Need", with code in PyTorch, <a href="https://nlp.seas.harvard.edu/2018/04/03/attention.html" rel="nofollow">https://nlp.seas.harvard.edu/2018/04/03/attention.html</a>.
besides everything that was mentioned here, what made it finally click for me early in my journey was running through this excellent tutorial by Peter Bloem multiple times <a href="https://peterbloem.nl/blog/transformers" rel="nofollow">https://peterbloem.nl/blog/transformers</a> highly recommend
Andrej Karpathy's 2 hour video and code is really good to understand the details of Transformers:<p>"Let's build GPT: from scratch, in code, spelled out."<p><a href="https://youtube.com/watch?v=kCc8FmEb1nY">https://youtube.com/watch?v=kCc8FmEb1nY</a>
Related:<p><i>Transformers from Scratch</i> - <a href="https://news.ycombinator.com/item?id=29315107" rel="nofollow">https://news.ycombinator.com/item?id=29315107</a> - Nov 2021 (17 comments)<p>also these, but it was a different article:<p><i>Transformers from Scratch (2019)</i> - <a href="https://news.ycombinator.com/item?id=29280909" rel="nofollow">https://news.ycombinator.com/item?id=29280909</a> - Nov 2021 (9 comments)<p><i>Transformers from Scratch</i> - <a href="https://news.ycombinator.com/item?id=20773992" rel="nofollow">https://news.ycombinator.com/item?id=20773992</a> - Aug 2019 (28 comments)
so I’m on the same journey of trying to teach myself ML and I do find most of the resources go over things very quickly and leave a lot you to figure out yourself.<p>Having had a quick look at this one, it looks very beginner, friendly, and also very careful to explain things slowly, so I will definitely added to my reading list.<p>Thanks to the author for this!
If you want a TensorFlow implementation, here it is: <a href="https://machinelearningmastery.com/building-transformer-models-with-attention-crash-course-build-a-neural-machine-translator-in-12-days/" rel="nofollow">https://machinelearningmastery.com/building-transformer-mode...</a>
Can somebody explain to me the sinus wave positional encoding thing? The naïve approach would be to just add number indices to the tokens, wouldn’t it?
Did anyone make the obvious "Robots in Smalltalk" joke yet?<p>Okay... here goes...<p>When I first read that title I thought the author was talking about Robots in Smalltalk.
This is cool, I highly recommend Jay Alammar’s Illustrated Transformer series to anyone wanting to get an understanding of the different types of transformers and how self-attention works.<p>The math behind self-attention is also cool and easy to extend to e.g. dual attention
Read this as "Transformers in Scratch" at first and was <i>very</i> curious.<p>Obviously implementing transformers in Scratch is likely impossible, but has anyone built a Scratch-like environment for building NN models?