It isn't mentioned in the abstract, but this seems to be more of an overview of ML-specific notions of gradient descent, where batch processing is possible due to needing to leverage gradients of a fixed prediction architecture over a large set of training data, with respect to tunable weights.<p>So each of those training points represents a sort of separable or parallelizable piece of the whole processes, giving you a ton of freedom in how you actually execute the gradient stepping (with one training point, several of them, or all of them). As I understand it, stochasticity in this process interestingly seems to add enough "noise" that local minima seem to be avoided in many cases.<p>In more general applications of non-linear gradient-based optimization (say for optimizing parametric models in physical engineering), this doesn't necessarily come into play.
Here[1] is an article describing the same, written by the author himself.<p>[1]: <a href="http://ruder.io/optimizing-gradient-descent/index.html" rel="nofollow">http://ruder.io/optimizing-gradient-descent/index.html</a>
Stupid Q: Assuming "gradient descent" is roughly similar to the classical "steepest descent" optimization algorithm (???), why aren't deep learning researchers looking into other more advanced algorithms from classical non-linear optimization theory. Like, say, (preconditioned) conjugate gradient, or quasi-Newton methods such as BFGS?
SGD > adaptive, according to this:<p><a href="https://people.eecs.berkeley.edu/~brecht/papers/17.WilEtAl.Ada.pdf" rel="nofollow">https://people.eecs.berkeley.edu/~brecht/papers/17.WilEtAl.A...</a>