TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Transformers as Support Vector Machines

251 pointsby fofozover 1 year ago

13 comments

sametoymakover 1 year ago
I am one of the authors. The most critical aspect is that transformer is a &quot;different kind of SVM&quot;. It solves an SVM that separates &#x27;good&#x27; tokens within each input sequence from &#x27;bad&#x27; tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.<p>This also explains how attention induces sparsity through softmax: &#x27;Bad&#x27; tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while &#x27;good&#x27; tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.<p>The title of the paper does not make this clear but hopefully abstract does :).
评论 #37527898 未加载
评论 #37371354 未加载
评论 #37371135 未加载
评论 #37371457 未加载
评论 #37374048 未加载
评论 #37374441 未加载
评论 #37371604 未加载
regularfryover 1 year ago
Practically speaking, does this give us anything interesting from an implementation perspective? My uneducated reading of this is that a single SVM layer is equivalent to the multiple steps in a transformer layer. I&#x27;m guessing it can&#x27;t reduce the number of computations purely from an information theory argument, but doesn&#x27;t it imply a radically simpler and easier to implement architecture?
评论 #37369305 未加载
评论 #37369162 未加载
评论 #37369647 未加载
评论 #37369555 未加载
评论 #37368838 未加载
abhinaiover 1 year ago
Fully connected neural networks are hierarchies of logistic regression nodes. Transformers are networks of SVM nodes. I guess we can expect networks of other kinds of classifiers in the future. Perhaps networks of Decision Tree nodes? Mix and match?
评论 #37370550 未加载
评论 #37368535 未加载
westurnerover 1 year ago
SVMs are randomly initialized (with arbitrary priors) and then are deterministic.<p>From &quot;What Is the Random Seed on SVM Sklearn, and Why Does It Produce Different Results?&quot; <a href="https:&#x2F;&#x2F;saturncloud.io&#x2F;blog&#x2F;what-is-the-random-seed-on-svm-sklearn-and-why-does-it-produce-different-results&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;saturncloud.io&#x2F;blog&#x2F;what-is-the-random-seed-on-svm-s...</a> :<p>&gt; <i>When you train an SVM model in sklearn, the algorithm uses a random initialization of the model parameters. This is necessary to avoid getting stuck in a local minimum during the optimization process.</i><p>&gt; <i>The random initialization is controlled by a parameter called the random seed. The random seed is a number that is used to initialize the random number generator. This ensures that the random initialization of the model parameters is consistent across different runs of the code</i><p>From &quot;Random Initialization For Neural Networks : A Thing Of The Past&quot; (2018) <a href="https:&#x2F;&#x2F;towardsdatascience.com&#x2F;random-initialization-for-neural-networks-a-thing-of-the-past-bfcdd806bf9e" rel="nofollow noreferrer">https:&#x2F;&#x2F;towardsdatascience.com&#x2F;random-initialization-for-neu...</a> :<p>&gt; <i>Lets look at three ways to initialize the weights between the layers before we start the forward, backward propagation to find the optimum weights.</i><p>&gt; <i>1: zero initialization</i><p>&gt; <i>2: random initialization</i><p>&gt; <i>3: he-et-al initialization</i><p>Deep learning: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Deep_learning" rel="nofollow noreferrer">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Deep_learning</a><p>SVM: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Support_vector_machine" rel="nofollow noreferrer">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Support_vector_machine</a><p>Is it guaranteed that SVMs converge upon a solution regardless of random seed?
评论 #37369496 未加载
评论 #37369783 未加载
gugagoreover 1 year ago
SVMs typically have weights per data point. I.e. nonparametric&#x2F;hyper parametric. Modern machine learning doesn&#x27;t really work like that anymore, right?
评论 #37372028 未加载
评论 #37371723 未加载
评论 #37372159 未加载
quickthrower2over 1 year ago
I would love an Andrej video on this
noduermeover 1 year ago
Fuck, imagine how many doctoral theses I could&#x27;ve written every time I tweaked a few lines of code to try some abstract way of recombining outputs I didn&#x27;t fully understand. I missed the boat. All this jargon is absolutely for show, though. Purely intended to create the impression that there&#x27;s some kind of moat to the &quot;discovery&quot;. There are much clearer ways to express &quot;we fucked around with putting the outputs of this black box back into the inputs&quot;, but I guess that doesn&#x27;t impress the rubes.
评论 #37369086 未加载
评论 #37370625 未加载
评论 #37372864 未加载
评论 #37370105 未加载
评论 #37369589 未加载
评论 #37372839 未加载
评论 #37369325 未加载
评论 #37370569 未加载
评论 #37371327 未加载
评论 #37380672 未加载
评论 #37368954 未加载
评论 #37371317 未加载
sdenton4over 1 year ago
[strike]Punk&#x27;s[&#x2F;strike] SVM&#x27;s not dead!<p>(More seriously, it&#x27;s good to find inroads to better formal understanding of what&#x27;s happening in these systems.)
sgt101over 1 year ago
Universal function approximator == universal function approximator
评论 #37369317 未加载
评论 #37369177 未加载
评论 #37369961 未加载
评论 #37371325 未加载
SomeoneFromCAover 1 year ago
Transformers as voltage amplifiers.
评论 #37369967 未加载
评论 #37369987 未加载
评论 #37371331 未加载
wizzard0over 1 year ago
my tldr: this explains<p>1) why huge models are important (so the gradient is high-dimensional enough to be monotonic)<p>2) why attention (aka connections, aka indirections) is trainable at all;<p>and says nothing about why they might generalize the dataset
评论 #37371336 未加载
adamnemecekover 1 year ago
All machine learning is about finding hyperplanes.
评论 #37369729 未加载
评论 #37368415 未加载
评论 #37368269 未加载
评论 #37368776 未加载
评论 #37368667 未加载
hexoover 1 year ago
Is this an April joke?