I notice this doesn't mention hinge loss, which is by far the simpler way of arriving at the SVM. Hinge loss is just max(0, 1- t*y), where y is the output of the linear model and t = +-1 is the label. Thus, it takes the common-sense approach of not penalizing losses that are far enough away from the decision boundary, and penalizing linearly after that.<p>An SVM is literally just a linear model with hinge loss instead of log loss (logistic regression) or squared loss (ordinary linear regression) in primal form. For apparently historical reasons, it is usually derived from the "hard-margin" SVM in dual form, motivating with trying to maximize the margin. This is complicated and not very intuitive.<p>This also causes people to conflate the kernel trick and the dual form, while in fact they have nothing to do with each other. You can use the kernel trick in primal svm just fine.<p>Stochastic gradient descent can also be used for primal methods, while it doesn't work in the dual. That makes it much faster for large problems than the dual.