I am one of the authors. The most critical aspect is that transformer is a "different kind of SVM". It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.<p>This also explains how attention induces sparsity through softmax: 'Bad' tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while 'good' tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.<p>The title of the paper does not make this clear but hopefully abstract does :).
Practically speaking, does this give us anything interesting from an implementation perspective? My uneducated reading of this is that a single SVM layer is equivalent to the multiple steps in a transformer layer. I'm guessing it can't reduce the number of computations purely from an information theory argument, but doesn't it imply a radically simpler and easier to implement architecture?
Fully connected neural networks are hierarchies of logistic regression nodes. Transformers are networks of SVM nodes. I guess we can expect networks of other kinds of classifiers in the future. Perhaps networks of Decision Tree nodes? Mix and match?
SVMs are randomly initialized (with arbitrary priors) and then are deterministic.<p>From "What Is the Random Seed on SVM Sklearn, and Why Does It Produce Different Results?" <a href="https://saturncloud.io/blog/what-is-the-random-seed-on-svm-sklearn-and-why-does-it-produce-different-results/" rel="nofollow noreferrer">https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s...</a> :<p>> <i>When you train an SVM model in sklearn, the algorithm uses a random initialization of the model parameters. This is necessary to avoid getting stuck in a local minimum during the optimization process.</i><p>> <i>The random initialization is controlled by a parameter called the random seed. The random seed is a number that is used to initialize the random number generator. This ensures that the random initialization of the model parameters is consistent across different runs of the code</i><p>From "Random Initialization For Neural Networks : A Thing Of The Past" (2018)
<a href="https://towardsdatascience.com/random-initialization-for-neural-networks-a-thing-of-the-past-bfcdd806bf9e" rel="nofollow noreferrer">https://towardsdatascience.com/random-initialization-for-neu...</a> :<p>> <i>Lets look at three ways to initialize the weights between the layers before we start the forward, backward propagation to find the optimum weights.</i><p>> <i>1: zero initialization</i><p>> <i>2: random initialization</i><p>> <i>3: he-et-al initialization</i><p>Deep learning:
<a href="https://en.wikipedia.org/wiki/Deep_learning" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Deep_learning</a><p>SVM: <a href="https://en.wikipedia.org/wiki/Support_vector_machine" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Support_vector_machine</a><p>Is it guaranteed that SVMs converge upon a solution regardless of random seed?
SVMs typically have weights per data point. I.e. nonparametric/hyper parametric. Modern machine learning doesn't really work like that anymore, right?
Fuck, imagine how many doctoral theses I could've written every time I tweaked a few lines of code to try some abstract way of recombining outputs I didn't fully understand. I missed the boat. All this jargon is absolutely for show, though. Purely intended to create the impression that there's some kind of moat to the "discovery". There are much clearer ways to express "we fucked around with putting the outputs of this black box back into the inputs", but I guess that doesn't impress the rubes.
[strike]Punk's[/strike] SVM's not dead!<p>(More seriously, it's good to find inroads to better formal understanding of what's happening in these systems.)
my tldr: this explains<p>1) why huge models are important (so the gradient is high-dimensional enough to be monotonic)<p>2) why attention (aka connections, aka indirections) is trainable at all;<p>and says nothing about why they might generalize the dataset