TL;DR: For high-dimensional models (say, with millions to billions of parameters), there's always a good set parameters nearby, and when we start descending towards it, we are highly unlikely to get stuck, because almost always there's at least one path down along at least one among of all those dimensions -- i.e., there are no local optima. Once we've stumbled upon a good set of parameters, as measured by validation, we can stop.<p>These intuitions are consistent with my experience... but I think there's more to deep learning.<p>For instance, these intuitions fail to explain "weird" phenomena, such as "double descent" and "interpolation thresholds":<p>* <a href="https://openai.com/blog/deep-double-descent/" rel="nofollow">https://openai.com/blog/deep-double-descent/</a><p>* <a href="https://arxiv.org/abs/1809.09349" rel="nofollow">https://arxiv.org/abs/1809.09349</a><p>* <a href="https://arxiv.org/abs/1812.11118" rel="nofollow">https://arxiv.org/abs/1812.11118</a><p>* See also: <a href="http://www.stat.cmu.edu/~ryantibs/papers/lsinter.pdf" rel="nofollow">http://www.stat.cmu.edu/~ryantibs/papers/lsinter.pdf</a><p>We still don't fully understand why stochastic gradient descent works so well in so many domains.