Back in around 2008, SVMs were all the rage in computer vision. We would use hand designed visual features and then a linear SVM on top. That was how object detectors were built (remember DPM?)<p>Funny how SVMs are just max-margin loss functions and we just took for granted that you needed domain expertise to craft features like HOG/SIFT by hand.<p>By 2018, we use ConvNets to learn BOTH the features and the classifier. In fact, it’s hard to separate where the features end and the classifier begins (in a modern CNN).
If you need closer to a ELI5 version I recommend this - [1].<p>Disclaimer: written by me.<p>[1] <a href="https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93" rel="nofollow">https://blog.statsbot.co/support-vector-machines-tutorial-c1...</a>
I notice this doesn't mention hinge loss, which is by far the simpler way of arriving at the SVM. Hinge loss is just max(0, 1- t*y), where y is the output of the linear model and t = +-1 is the label. Thus, it takes the common-sense approach of not penalizing losses that are far enough away from the decision boundary, and penalizing linearly after that.<p>An SVM is literally just a linear model with hinge loss instead of log loss (logistic regression) or squared loss (ordinary linear regression) in primal form. For apparently historical reasons, it is usually derived from the "hard-margin" SVM in dual form, motivating with trying to maximize the margin. This is complicated and not very intuitive.<p>This also causes people to conflate the kernel trick and the dual form, while in fact they have nothing to do with each other. You can use the kernel trick in primal svm just fine.<p>Stochastic gradient descent can also be used for primal methods, while it doesn't work in the dual. That makes it much faster for large problems than the dual.
There's been some work on variational Bayesian formulations of SVMs in the last few years. These can give actual uncertainty estimates and do automatic hyperparameter tuning. This one in particular is very cool:<p><a href="https://arxiv.org/pdf/1707.05532.pdf" rel="nofollow">https://arxiv.org/pdf/1707.05532.pdf</a>
It's interesting how quickly support vector machines went from the hot new thing to classify images to an afterthought after deep learning started having great results.
Bullet point on page 2: "Optimal hyperplane for linearly separable patterns"<p>I think the author may be working from a very different definition of the word "idiot".
I have a question!<p>In the pdf, it said that the optimization problem in SVMs have a nice property in that it was quadratic, which means that there's a nice global minimum to go towards, and not lots of local minimum like in NN. That means, it seems SVMs won't get stuck at a suboptimal solution.<p>Is that not a problem in DNNs now? Or is it that it's such high dimensionality that local minima don't stop the optimizer, because there's always another way around the local minimum?