Transformers from Scratch (2021)

644 点作者 jasim大约 2 年前

16 条评论

Buttons840大约 2 年前

This article describes positional encodings based on several sine waves with different frequencies, but I've also seen positional "embeddings" used, where the position (the position is an integer value) is used to select an differentiable embedding from an embedding table. Thus, the model learns its own positional encoding. Does anyone know how these compare?I've also wondered why we add the positional encoding to the value, rather than concatenating them?Also, the terms encoding, embedding, projection, and others are all starting to sound the same to me. I'm not sure exactly what the difference is. Linear projections start to look like embeddings start to look like encodings start to look like projections, etc. I guess that's just the nature of linear algebra? It's all the same? The data is the computation, and the computation is the data. Numbers in, numbers out, and if the wrong numbers come out then God help you.I digress. Is there a distinction between encoding, embedding, and projection I should be aware of?I recently read in "The Little Learner" book that finding the right parameters is learning. That's the point. Everything we do in deep learning is focused on choosing the right sequence of numbers and we call those numbers parameters. Every parameter has a specific role in our model. Parameters are our choice, those are the nobs that we (as a personified machine learning algorithm) get to adjust. Ever since then the word "parameters" has been much more meaningful to me. I'm hoping for similar clarity with these other words.

评论 #35712203 未加载

评论 #35712020 未加载

评论 #35719449 未加载

评论 #35713030 未加载

dsubburam大约 2 年前

An early explainer of transformers, which is a quicker read, that I found very useful when they were still new to me, is The Illustrated Transformer[1], by Jay Alammar.A more recent academic but high-level explanation of transformers, very good for detail on the different flow flavors (e.g. encoder-decoder vs decoder only), is Formal Algorithms for Transformers[2], from DeepMind.[1] <a href="https://jalammar.github.io/illustrated-transformer/" rel="nofollow">https://jalammar.github.io/illustrated-transformer/</a> [2] <a href="https://arxiv.org/abs/2207.09238" rel="nofollow">https://arxiv.org/abs/2207.09238</a>

评论 #35712334 未加载

评论 #35713993 未加载

评论 #35723748 未加载

评论 #35713556 未加载

stared大约 2 年前

Thank you for sharing!For the "from scratch" version, I recommend "The GPT-3 Architecture, on a Napkin" <a href="https://dugas.ch/artificial_curiosity/GPT_architecture.html" rel="nofollow">https://dugas.ch/artificial_curiosity/GPT_architecture.html</a>, which was there as well (<a href="https://news.ycombinator.com/item?id=33942597" rel="nofollow">https://news.ycombinator.com/item?id=33942597</a>).Then, to actually dive into details, "The Annotated Transformer", i.e. a walktrough "Attention Is All You Need", with code in PyTorch, <a href="https://nlp.seas.harvard.edu/2018/04/03/attention.html" rel="nofollow">https://nlp.seas.harvard.edu/2018/04/03/attention.html</a>.

评论 #35717371 未加载

lucidrains大约 2 年前

besides everything that was mentioned here, what made it finally click for me early in my journey was running through this excellent tutorial by Peter Bloem multiple times <a href="https://peterbloem.nl/blog/transformers" rel="nofollow">https://peterbloem.nl/blog/transformers</a> highly recommend

评论 #35719380 未加载

erwincoumans大约 2 年前

Andrej Karpathy's 2 hour video and code is really good to understand the details of Transformers:"Let's build GPT: from scratch, in code, spelled out."<a href="https://youtube.com/watch?v=kCc8FmEb1nY">https://youtube.com/watch?v=kCc8FmEb1nY</a>

dang大约 2 年前

Related:Transformers from Scratch - <a href="https://news.ycombinator.com/item?id=29315107" rel="nofollow">https://news.ycombinator.com/item?id=29315107</a> - Nov 2021 (17 comments)also these, but it was a different article:Transformers from Scratch (2019) - <a href="https://news.ycombinator.com/item?id=29280909" rel="nofollow">https://news.ycombinator.com/item?id=29280909</a> - Nov 2021 (9 comments)Transformers from Scratch - <a href="https://news.ycombinator.com/item?id=20773992" rel="nofollow">https://news.ycombinator.com/item?id=20773992</a> - Aug 2019 (28 comments)

quickthrower2大约 2 年前

so I’m on the same journey of trying to teach myself ML and I do find most of the resources go over things very quickly and leave a lot you to figure out yourself.Having had a quick look at this one, it looks very beginner, friendly, and also very careful to explain things slowly, so I will definitely added to my reading list.Thanks to the author for this!

评论 #35711582 未加载

adriantam大约 2 年前

If you want a TensorFlow implementation, here it is: <a href="https://machinelearningmastery.com/building-transformer-models-with-attention-crash-course-build-a-neural-machine-translator-in-12-days/" rel="nofollow">https://machinelearningmastery.com/building-transformer-mode...</a>

leobg大约 2 年前

Can somebody explain to me the sinus wave positional encoding thing? The naïve approach would be to just add number indices to the tokens, wouldn’t it?

评论 #35718990 未加载

评论 #35718985 未加载

dingosity大约 2 年前

Did anyone make the obvious "Robots in Smalltalk" joke yet?Okay... here goes...When I first read that title I thought the author was talking about Robots in Smalltalk.

toyg大约 2 年前

MORE THAN MEETS THE EYE!... oh, not those Transformers. Meh.

评论 #35713731 未加载

评论 #35712690 未加载

cuuupid大约 2 年前

This is cool, I highly recommend Jay Alammar’s Illustrated Transformer series to anyone wanting to get an understanding of the different types of transformers and how self-attention works.The math behind self-attention is also cool and easy to extend to e.g. dual attention

JackFr大约 2 年前

Read this as "Transformers in Scratch" at first and was very curious.Obviously implementing transformers in Scratch is likely impossible, but has anyone built a Scratch-like environment for building NN models?

pmoriarty大约 2 年前

So how practical is learning to create your own transformers if you can't afford a giant amount of resources to train them?

评论 #35715960 未加载

metalloid大约 2 年前

The author of the article should had provided an implementation of the transformer using only numpy or pure C++.

评论 #35715398 未加载

sachinkalsi大约 2 年前

Check this out <a href="https://youtu.be/73gTEub2e3I" rel="nofollow">https://youtu.be/73gTEub2e3I</a>