TechEcho

13 comments

sametoymakover 1 year ago

I am one of the authors. The most critical aspect is that transformer is a "different kind of SVM". It solves an SVM that separates 'good' tokens within each input sequence from 'bad' tokens. This SVM serves as a good-token-selector and is inherently different from the traditional SVM which assigns a 0-1 label to inputs.This also explains how attention induces sparsity through softmax: 'Bad' tokens that fall on the wrong side of the SVM decision boundary are suppressed by the softmax function, while 'good' tokens are those that end up with non-zero softmax probabilities. It is also worth mentioning this SVM arises from the exponential nature of the softmax.The title of the paper does not make this clear but hopefully abstract does :).

评论 #37527898 未加载

评论 #37371354 未加载

评论 #37371135 未加载

评论 #37371457 未加载

评论 #37374048 未加载

评论 #37374441 未加载

评论 #37371604 未加载

regularfryover 1 year ago

Practically speaking, does this give us anything interesting from an implementation perspective? My uneducated reading of this is that a single SVM layer is equivalent to the multiple steps in a transformer layer. I'm guessing it can't reduce the number of computations purely from an information theory argument, but doesn't it imply a radically simpler and easier to implement architecture?

评论 #37369305 未加载

评论 #37369162 未加载

评论 #37369647 未加载

评论 #37369555 未加载

评论 #37368838 未加载

abhinaiover 1 year ago

Fully connected neural networks are hierarchies of logistic regression nodes. Transformers are networks of SVM nodes. I guess we can expect networks of other kinds of classifiers in the future. Perhaps networks of Decision Tree nodes? Mix and match?

评论 #37370550 未加载

评论 #37368535 未加载

westurnerover 1 year ago

SVMs are randomly initialized (with arbitrary priors) and then are deterministic.From "What Is the Random Seed on SVM Sklearn, and Why Does It Produce Different Results?" <a href="https://saturncloud.io/blog/what-is-the-random-seed-on-svm-sklearn-and-why-does-it-produce-different-results/" rel="nofollow noreferrer">https://saturncloud.io/blog/what-is-the-random-seed-on-svm-s...</a> :> When you train an SVM model in sklearn, the algorithm uses a random initialization of the model parameters. This is necessary to avoid getting stuck in a local minimum during the optimization process.> The random initialization is controlled by a parameter called the random seed. The random seed is a number that is used to initialize the random number generator. This ensures that the random initialization of the model parameters is consistent across different runs of the codeFrom "Random Initialization For Neural Networks : A Thing Of The Past" (2018) <a href="https://towardsdatascience.com/random-initialization-for-neural-networks-a-thing-of-the-past-bfcdd806bf9e" rel="nofollow noreferrer">https://towardsdatascience.com/random-initialization-for-neu...</a> :> Lets look at three ways to initialize the weights between the layers before we start the forward, backward propagation to find the optimum weights.> 1: zero initialization> 2: random initialization> 3: he-et-al initializationDeep learning: <a href="https://en.wikipedia.org/wiki/Deep_learning" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Deep_learning</a>SVM: <a href="https://en.wikipedia.org/wiki/Support_vector_machine" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Support_vector_machine</a>Is it guaranteed that SVMs converge upon a solution regardless of random seed?

评论 #37369496 未加载

评论 #37369783 未加载

gugagoreover 1 year ago

SVMs typically have weights per data point. I.e. nonparametric/hyper parametric. Modern machine learning doesn't really work like that anymore, right?

评论 #37372028 未加载

评论 #37371723 未加载

评论 #37372159 未加载

quickthrower2over 1 year ago

I would love an Andrej video on this

noduermeover 1 year ago

Fuck, imagine how many doctoral theses I could've written every time I tweaked a few lines of code to try some abstract way of recombining outputs I didn't fully understand. I missed the boat. All this jargon is absolutely for show, though. Purely intended to create the impression that there's some kind of moat to the "discovery". There are much clearer ways to express "we fucked around with putting the outputs of this black box back into the inputs", but I guess that doesn't impress the rubes.

评论 #37369086 未加载

评论 #37370625 未加载

评论 #37372864 未加载

评论 #37370105 未加载

评论 #37369589 未加载

评论 #37372839 未加载

评论 #37369325 未加载

评论 #37370569 未加载

评论 #37371327 未加载

评论 #37380672 未加载

评论 #37368954 未加载

评论 #37371317 未加载

sdenton4over 1 year ago

[strike]Punk's[/strike] SVM's not dead!(More seriously, it's good to find inroads to better formal understanding of what's happening in these systems.)

sgt101over 1 year ago

Universal function approximator == universal function approximator

评论 #37369317 未加载

评论 #37369177 未加载

评论 #37369961 未加载

评论 #37371325 未加载

SomeoneFromCAover 1 year ago

Transformers as voltage amplifiers.

评论 #37369967 未加载

评论 #37369987 未加载

评论 #37371331 未加载

wizzard0over 1 year ago

my tldr: this explains1) why huge models are important (so the gradient is high-dimensional enough to be monotonic)2) why attention (aka connections, aka indirections) is trainable at all;and says nothing about why they might generalize the dataset

评论 #37371336 未加载

adamnemecekover 1 year ago

All machine learning is about finding hyperplanes.

评论 #37369729 未加载

评论 #37368415 未加载

评论 #37368269 未加载

评论 #37368776 未加载

评论 #37368667 未加载

hexoover 1 year ago

Is this an April joke?

13 comments

sametoymakover 1 year ago

评论 #37527898 未加载

评论 #37371354 未加载

评论 #37371135 未加载

评论 #37371457 未加载

评论 #37374048 未加载

评论 #37374441 未加载

评论 #37371604 未加载

regularfryover 1 year ago

评论 #37369305 未加载

评论 #37369162 未加载

评论 #37369647 未加载

评论 #37369555 未加载

评论 #37368838 未加载

abhinaiover 1 year ago

评论 #37370550 未加载

评论 #37368535 未加载

westurnerover 1 year ago

评论 #37369496 未加载

评论 #37369783 未加载

gugagoreover 1 year ago

SVMs typically have weights per data point. I.e. nonparametric/hyper parametric. Modern machine learning doesn't really work like that anymore, right?

评论 #37372028 未加载

评论 #37371723 未加载

评论 #37372159 未加载

quickthrower2over 1 year ago

I would love an Andrej video on this

noduermeover 1 year ago

评论 #37369086 未加载

评论 #37370625 未加载

评论 #37372864 未加载

评论 #37370105 未加载

评论 #37369589 未加载

评论 #37372839 未加载

评论 #37369325 未加载

评论 #37370569 未加载

评论 #37371327 未加载

评论 #37380672 未加载

评论 #37368954 未加载

评论 #37371317 未加载

sdenton4over 1 year ago

[strike]Punk's[/strike] SVM's not dead!(More seriously, it's good to find inroads to better formal understanding of what's happening in these systems.)

Transformers as Support Vector Machines

13 comments

Transformers as Support Vector Machines

13 comments