TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Transformers from Scratch (2021)

644 pointsby jasimabout 2 years ago

16 comments

Buttons840about 2 years ago
This article describes positional encodings based on several sine waves with different frequencies, but I&#x27;ve also seen positional &quot;embeddings&quot; used, where the position (the position is an integer value) is used to select an differentiable embedding from an embedding table. Thus, the model learns its own positional encoding. Does anyone know how these compare?<p>I&#x27;ve also wondered why we add the positional encoding to the value, rather than concatenating them?<p>Also, the terms encoding, embedding, projection, and others are all starting to sound the same to me. I&#x27;m not sure exactly what the difference is. Linear projections start to look like embeddings start to look like encodings start to look like projections, etc. I guess that&#x27;s just the nature of linear algebra? It&#x27;s all the same? The data is the computation, and the computation is the data. Numbers in, numbers out, and if the wrong numbers come out then God help you.<p>I digress. Is there a distinction between encoding, embedding, and projection I should be aware of?<p>I recently read in &quot;The Little Learner&quot; book that finding the right parameters <i>is</i> learning. That&#x27;s the point. Everything we do in deep learning is focused on choosing the right sequence of numbers and we call those numbers <i>parameters</i>. Every parameter has a specific role in our model. <i>Parameters</i> are our choice, those are the nobs that we (as a personified machine learning algorithm) get to adjust. Ever since then the word &quot;parameters&quot; has been much more meaningful to me. I&#x27;m hoping for similar clarity with these other words.
评论 #35712203 未加载
评论 #35712020 未加载
评论 #35719449 未加载
评论 #35713030 未加载
dsubburamabout 2 years ago
An early explainer of transformers, which is a quicker read, that I found very useful when they were still new to me, is The Illustrated Transformer[1], by Jay Alammar.<p>A more recent academic but high-level explanation of transformers, very good for detail on the different flow flavors (e.g. encoder-decoder vs decoder only), is Formal Algorithms for Transformers[2], from DeepMind.<p>[1] <a href="https:&#x2F;&#x2F;jalammar.github.io&#x2F;illustrated-transformer&#x2F;" rel="nofollow">https:&#x2F;&#x2F;jalammar.github.io&#x2F;illustrated-transformer&#x2F;</a> [2] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2207.09238" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2207.09238</a>
评论 #35712334 未加载
评论 #35713993 未加载
评论 #35723748 未加载
评论 #35713556 未加载
staredabout 2 years ago
Thank you for sharing!<p>For the &quot;from scratch&quot; version, I recommend &quot;The GPT-3 Architecture, on a Napkin&quot; <a href="https:&#x2F;&#x2F;dugas.ch&#x2F;artificial_curiosity&#x2F;GPT_architecture.html" rel="nofollow">https:&#x2F;&#x2F;dugas.ch&#x2F;artificial_curiosity&#x2F;GPT_architecture.html</a>, which was there as well (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=33942597" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=33942597</a>).<p>Then, to actually dive into details, &quot;The Annotated Transformer&quot;, i.e. a walktrough &quot;Attention Is All You Need&quot;, with code in PyTorch, <a href="https:&#x2F;&#x2F;nlp.seas.harvard.edu&#x2F;2018&#x2F;04&#x2F;03&#x2F;attention.html" rel="nofollow">https:&#x2F;&#x2F;nlp.seas.harvard.edu&#x2F;2018&#x2F;04&#x2F;03&#x2F;attention.html</a>.
评论 #35717371 未加载
lucidrainsabout 2 years ago
besides everything that was mentioned here, what made it finally click for me early in my journey was running through this excellent tutorial by Peter Bloem multiple times <a href="https:&#x2F;&#x2F;peterbloem.nl&#x2F;blog&#x2F;transformers" rel="nofollow">https:&#x2F;&#x2F;peterbloem.nl&#x2F;blog&#x2F;transformers</a> highly recommend
评论 #35719380 未加载
erwincoumansabout 2 years ago
Andrej Karpathy&#x27;s 2 hour video and code is really good to understand the details of Transformers:<p>&quot;Let&#x27;s build GPT: from scratch, in code, spelled out.&quot;<p><a href="https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=kCc8FmEb1nY">https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=kCc8FmEb1nY</a>
dangabout 2 years ago
Related:<p><i>Transformers from Scratch</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=29315107" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=29315107</a> - Nov 2021 (17 comments)<p>also these, but it was a different article:<p><i>Transformers from Scratch (2019)</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=29280909" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=29280909</a> - Nov 2021 (9 comments)<p><i>Transformers from Scratch</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20773992" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=20773992</a> - Aug 2019 (28 comments)
quickthrower2about 2 years ago
so I’m on the same journey of trying to teach myself ML and I do find most of the resources go over things very quickly and leave a lot you to figure out yourself.<p>Having had a quick look at this one, it looks very beginner, friendly, and also very careful to explain things slowly, so I will definitely added to my reading list.<p>Thanks to the author for this!
评论 #35711582 未加载
adriantamabout 2 years ago
If you want a TensorFlow implementation, here it is: <a href="https:&#x2F;&#x2F;machinelearningmastery.com&#x2F;building-transformer-models-with-attention-crash-course-build-a-neural-machine-translator-in-12-days&#x2F;" rel="nofollow">https:&#x2F;&#x2F;machinelearningmastery.com&#x2F;building-transformer-mode...</a>
leobgabout 2 years ago
Can somebody explain to me the sinus wave positional encoding thing? The naïve approach would be to just add number indices to the tokens, wouldn’t it?
评论 #35718990 未加载
评论 #35718985 未加载
dingosityabout 2 years ago
Did anyone make the obvious &quot;Robots in Smalltalk&quot; joke yet?<p>Okay... here goes...<p>When I first read that title I thought the author was talking about Robots in Smalltalk.
toygabout 2 years ago
MORE THAN MEETS THE EYE!<p>... oh, not <i>those</i> Transformers. Meh.
评论 #35713731 未加载
评论 #35712690 未加载
cuuupidabout 2 years ago
This is cool, I highly recommend Jay Alammar’s Illustrated Transformer series to anyone wanting to get an understanding of the different types of transformers and how self-attention works.<p>The math behind self-attention is also cool and easy to extend to e.g. dual attention
JackFrabout 2 years ago
Read this as &quot;Transformers in Scratch&quot; at first and was <i>very</i> curious.<p>Obviously implementing transformers in Scratch is likely impossible, but has anyone built a Scratch-like environment for building NN models?
pmoriartyabout 2 years ago
So how practical is learning to create your own transformers if you can&#x27;t afford a giant amount of resources to train them?
评论 #35715960 未加载
metalloidabout 2 years ago
The author of the article should had provided an implementation of the transformer using only numpy or pure C++.
评论 #35715398 未加载
sachinkalsiabout 2 years ago
Check this out <a href="https:&#x2F;&#x2F;youtu.be&#x2F;73gTEub2e3I" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;73gTEub2e3I</a>