Why deep learning works even though it shouldn’t

356 pointsby r4umover 4 years ago

23 comments

brundolfover 4 years ago

Setting aside the primary subject, this is an excellent observation:> What I find however is that there are a base of unspoken intuitions that underlie expert understanding of a field, that are never directly stated in the literature, because they can’t be easily proved with the rigor that the literature demands. And as a result, the insights exist only in conversation and subtext, which make them inaccessible to the casual reader.

评论 #24842875 未加载

评论 #24843909 未加载

评论 #24843238 未加载

评论 #24846221 未加载

blackbear_over 4 years ago

There is another reason why training deep neural networks is not as difficult as it sounds: the landscape of the loss function seems to be made of broad "U"-shaped valleys that gently descend towards a small loss region. At initialization, the network is likely close to such valley, and once it gets there the rest of training is just a leisurely stroll.Formally, people have studied the spectrum of the Hessian and found that most of its eigenvectors are quite small with only a few, much larger ones. It all started with [1], with several recent extensions.[1] <a href="https://arxiv.org/pdf/1611.07476.pdf" rel="nofollow">https://arxiv.org/pdf/1611.07476.pdf</a>

评论 #24844027 未加载

评论 #24842514 未加载

评论 #24841539 未加载

评论 #24841868 未加载

评论 #24842493 未加载

maxwells-daemonover 4 years ago

I don't think we can say for sure that early stopping is the main reason deep networks generalize. Double descent [1] shows that models continue to improve even once they've "interpolated" the training data (fit every point perfectly), and critical periods [2] suggest that the early part of training is responsible for most of the generalization performance even though much of the numerical improvement happens later.Overall it looks like gradient descent is a strong regularizer -- we know it tends to prefer small and low-variance weights, for example. So part of deep generalization has to do with how SGD is able to pick "good" features early, and then optimization pushes the unimportant weights to zero later (hence lottery tickets).[1] <a href="https://openai.com/blog/deep-double-descent/" rel="nofollow">https://openai.com/blog/deep-double-descent/</a> and other papers. [2] <a href="https://arxiv.org/abs/1711.08856" rel="nofollow">https://arxiv.org/abs/1711.08856</a> and others.

评论 #24852807 未加载

评论 #24845186 未加载

RandyRandersonover 4 years ago

Certainly a much shorter way to say this: if you have enough lines you can approximate any curve within a margin. This is what large neural networks are doing.Another way to look at it: most neural nets are just a bunch of polynomials stitched together. You can see this from the popularity of the relu activation function. when the relu is negative, that poly is always zero in that area. When positive it's some poly multiplied by a const - another poly.For nets that use other activation fns, they try to be linear in the area of most active input. So again they approximate a const * a ploy.

评论 #24843381 未加载

评论 #24840697 未加载

评论 #24841316 未加载

评论 #24841133 未加载

评论 #24843363 未加载

评论 #24840635 未加载

Animatsover 4 years ago

Now that's fascinating.I'd thought of machine learning as a form of optimization. Things like support vector machines really were hill climbing for some kind of local optimum point. But at a billion dimensions, you're doing something else entirely. I once went through Andrew Ng's old machine learning course on video, and he was definitely doing optimization.The last time I actually had to do numerical optimization using gradients, I was trying to solve nonlinear differential equations for a physics engine in about a 20-dimensional space of joint angles that was very "stiff". That is, some dimensions might be many orders of magnitude steeper than another. It's like walking on a narrow zigzagging mountain ridge without falling off.So deep learning is not at all like either of those. Hm.

Schipholover 4 years ago

I loved this post, thanks for writing it. I get the argument why one shouldn't expect local minima in very high dimensions. But then, what's wrong with the informal argument that there has to be a minimum because, well, the expected loss cannot be negative?

评论 #24841865 未加载

21elevenover 4 years ago

I liked that part at the beginning where the author made clear they were going to discuss intuitions that, while they aren't proven, would be useful to make explicit for a more general audience. Good candor.

评论 #24841667 未加载

acadienover 4 years ago

Hey @moultano in response to your argument about walls and Nets not being in a minima, its my understanding nets always live on high dimensional saddle points and that's commonly referred to in literature. Even when you're optimizing you're just moving towards ever lower cost saddles that are closer to the optimum but almost never a local optimum (for the reasons spelled out in your post).

评论 #24840944 未加载

评论 #24842534 未加载

ZeljkoSover 4 years ago

It looks like the article was deleted ("Oops! That page can’t be found"). Here is the Google Cache: <a href="https://webcache.googleusercontent.com/search?q=cache:5HMZ_ZO0mwkJ:https://moultano.wordpress.com/2020/10/18/why-deep-learning-works-even-though-it-shouldnt/+&cd=1&hl=en&ct=clnk&gl=hr" rel="nofollow">https://webcache.googleusercontent.com/search?q=cache:5HMZ_Z...</a>

quicklimeover 4 years ago

> High dimensional spaces are unlikely to have local optima, and probably don’t have any optima at all.Can someone who knows more about DL than I do help me understand this a little better?The article uses the analogy of walls:> Just recall what is necessary for a set of parameters to be at a optimum. All the gradients need to be zero, and the hessian needs to be positive semidefinite. In other words, you need to be surrounded by walls. In 4 dimensions, you can walk through walls. GPT3 has 175 billion parameters. In 175 billion dimensions, walls are so far beneath your notice that if you observe them at all it is like God looking down upon individual protons.I'm struggling to understand what this really means in 4+ dimensions. But when I try to envision it going from 1 or 2 to 3 dimensions, it doesn't seem obvious at all that a 3D space should have fewer local optima than a 2D space.In fact, having a "universal function" like a deep network seems like it should have more local optima. What am I missing?

评论 #24840371 未加载

评论 #24840654 未加载

评论 #24840329 未加载

评论 #24840774 未加载

评论 #24840729 未加载

评论 #24840479 未加载

评论 #24840593 未加载

cs702over 4 years ago

TL;DR: For high-dimensional models (say, with millions to billions of parameters), there's always a good set parameters nearby, and when we start descending towards it, we are highly unlikely to get stuck, because almost always there's at least one path down along at least one among of all those dimensions -- i.e., there are no local optima. Once we've stumbled upon a good set of parameters, as measured by validation, we can stop.These intuitions are consistent with my experience... but I think there's more to deep learning.For instance, these intuitions fail to explain "weird" phenomena, such as "double descent" and "interpolation thresholds":* <a href="https://openai.com/blog/deep-double-descent/" rel="nofollow">https://openai.com/blog/deep-double-descent/</a>* <a href="https://arxiv.org/abs/1809.09349" rel="nofollow">https://arxiv.org/abs/1809.09349</a>* <a href="https://arxiv.org/abs/1812.11118" rel="nofollow">https://arxiv.org/abs/1812.11118</a>* See also: <a href="http://www.stat.cmu.edu/~ryantibs/papers/lsinter.pdf" rel="nofollow">http://www.stat.cmu.edu/~ryantibs/papers/lsinter.pdf</a>We still don't fully understand why stochastic gradient descent works so well in so many domains.

KKKKkkkk1over 4 years ago

But it doesn't. Researchers have been saying for several years now that computer vision is more accurate than human vision, and face recognition was one of the first problems "solved." And yet when the pandemic hit, Apple had to scramble to adjust its unlock mechanism in iOS 13.5 because Face ID cannot recognize people wearing masks [1]. Humans have no trouble identifying people wearing masks. We are now almost a year into the pandemic, iOS 14 has been released, Face ID has not been fixed, and now we hear that Apple is bringing Touch ID back [2].So sure, you've developed a methodology that can overfit nicely not only the train data but even the test data. But it still fails miserably when you apply your model in the field.[1] <a href="https://www.theverge.com/2020/5/20/21265019/apple-ios-13-5-out-now-unlock-iphone-face-mask-id-exposure-notification-covid-19" rel="nofollow">https://www.theverge.com/2020/5/20/21265019/apple-ios-13-5-o...</a>[2] <a href="https://appleinsider.com/articles/20/10/16/under-display-touch-id-on-an-iphone-is-still-coming-leaker-claims" rel="nofollow">https://appleinsider.com/articles/20/10/16/under-display-tou...</a>

评论 #24845156 未加载

评论 #24848183 未加载

评论 #24845714 未加载

评论 #24845103 未加载

unimpossibleover 4 years ago

A lot of people talk about minima because thats the language we have for analyzing optimization techniques. Deep Learning is still new enough that there is lots of low hanging fruit to explore including empirical approaches and applying existing theoretical tools to try and explain DNNs. The community is slowly moving towards developing new tools specifically for deep learning properly analyze these networks and prove stuff (bounds, convergence etc) about them.

mrfusionover 4 years ago

This quote:there are a base of unspoken intuitions that underlie expert understanding of a field, that are never directly stated in the literature, because they can’t be easily proved with the rigor that the literature demands. And as a result, the insights exist only in conversation and subtext, which make them inaccessible to the casual reader.I’d love to hear these intuitions from every field. Anyone got some?

评论 #24843946 未加载

foxesover 4 years ago

There is a trick in physics I am reminded about. In infinite dimensions there is no way to have a Gaussian measure on just an infinite dimensional Hilbert space. It needs to be embedded inside a bigger infinite dimensional space and then you can have some relative measure.So you do not look at just a Gaussian integral, you look at a quotient of Gaussian integrals.Perhaps there is a similar idea. Perhaps there is some sort of renormalisation that would make neural networks work better. Even if your neural network is infinite dimensional it still makes sense to talk about some surface relatively.

YeGoblynQueenneover 4 years ago

I find the article style unreadable. Could someone please say whether the author explains why deep learning shouldn't work?There is a bit at the start about how people in statistical learning throw their hands up at deep learning etc, but none of that makes sense to me. Neural nets are an idea as old as AI - even older, in fact. The need for deeper networks was well understood by the 1980's. There are well known results about feedforward neural nets with arbitrary hideen units being universal function approximators. Why shouldn't deep learning work?

评论 #24842716 未加载

woopwoopover 4 years ago

It seems to me all of these arguments apply just as well when the "deep" network has only one hidden layer.

评论 #24844260 未加载

mikhailfrancoover 4 years ago

Another interesting explanation for deep learning's success in the physical world:Why does deep and cheap learning work so well?Max Tegmark et al.,<a href="https://arxiv.org/abs/1608.08225" rel="nofollow">https://arxiv.org/abs/1608.08225</a>

ummonkover 4 years ago

Great article, though I never understood why people would think deep learning shouldn't work.

hexoover 4 years ago

Blank page? :(

turingbookover 4 years ago

The article was gone?

jostmeyover 4 years ago

The author argues that deep-learning has abandoned statistics. I could not disagree more! Too much of the field was concerned with detailed proofs and mathematical formalism that were somewhat disconnected from probability theory. Modern machine learning (or AI or whatever) still has strong roots in probability and statistics. Loss functions are still based on concepts such as the log-likelihood function.Formal proofs and mathematics are essential, but can become a distraction from the end goal. It is like playing Chess by going after your opponents pawns instead of their king. I would say modern machine learning has become tantamount to experimental physics and this article is written from the perspective of a string theory theorist.

评论 #24840520 未加载

ytersover 4 years ago

How do we define 'works'?

评论 #24840164 未加载