This article describes positional encodings based on several sine waves with different frequencies, but I've also seen positional "embeddings" used, where the position (the position is an integer value) is used to select an differentiable embedding from an embedding table. Thus, the model learns its own positional encoding. Does anyone know how these compare?<p>I've also wondered why we add the positional encoding to the value, rather than concatenating them?<p>Also, the terms encoding, embedding, projection, and others are all starting to sound the same to me. I'm not sure exactly what the difference is. Linear projections start to look like embeddings start to look like encodings start to look like projections, etc. I guess that's just the nature of linear algebra? It's all the same? The data is the computation, and the computation is the data. Numbers in, numbers out, and if the wrong numbers come out then God help you.<p>I digress. Is there a distinction between encoding, embedding, and projection I should be aware of?<p>I recently read in "The Little Learner" book that finding the right parameters <i>is</i> learning. That's the point. Everything we do in deep learning is focused on choosing the right sequence of numbers and we call those numbers <i>parameters</i>. Every parameter has a specific role in our model. <i>Parameters</i> are our choice, those are the nobs that we (as a personified machine learning algorithm) get to adjust. Ever since then the word "parameters" has been much more meaningful to me. I'm hoping for similar clarity with these other words.