TechEcho

10 comments

MauranKilomalmost 3 years ago

The "wedge" part under "3. Mode Connectivity" has at least one obvious component: Neural networks tend to be invariant to permuting nodes (together with their connections) within a layer. Simply put, it doesn't matter in what order you number the K nodes of e.g. a fully connected layer, but that alone already means there are K! different solutions with exactly the same behavior. Equivalently, the loss landscape is symmetric to certain permutations of its dimensions.This means that, at the very least, there are many global optima (well, unless all permutable weights end up with the same value, which is obviously not the case). The fact that different initializations/early training steps can end up in different but equivalent optima follows directly from this symmetry. But whether all their basins are connected, or whether there are just multiple equivalent basins, is much less clear. The "non-linear" connection stuff does seem to imply that they are all in some (high-dimensional, non-linear) valley.To be clear, this is just me looking at these results from the "permutation" perspective above, because it leads to a few obvious conclusions. But I am not qualified to judge which of these results are more or less profound.

评论 #32133741 未加载

评论 #32132205 未加载

评论 #32133548 未加载

evolvingstuffalmost 3 years ago

Here are some "animated" loss landscapes I made quite a long time ago:<a href="http://evolvingstuff.blogspot.com/2011/02/animated-fractal-fitness-landscapes.html" rel="nofollow">http://evolvingstuff.blogspot.com/2011/02/animated-fractal-f...</a>These are related to recurrent neural networks evolved to maximize fitness whilst wandering through a randomly generated maze and picking up food pellets (the advantage being to remember not to revisit where you have already been.)

评论 #32134108 未加载

YeGoblynQueennealmost 3 years ago

>> This leads to one of the key questions of deep learning, currently: Why do neural networks prefer solutions that generalize to unseen data, rather than settling on solutions which simply memorize the training data without actually learning anything?That's the researchers who prefer these solutions, not the networks. And that's how the networks find them: because the experimenters have access to the test data and they keep tuning their networks' parameterers until they perfectly fit not only the training, but also the _test_ data.In that sense the testing data is not "unseen". The neural net doesn't "see" it during training but the researchers do and they can try to improve the network's performance on it, because they control everything about how the network is trained, when it stops training etc etc.It's nothing to do with loss functions and the answers are not in the maths. It's good, old researcher bias and it has to be controlled by othear means, namely, rigorous design _and description_ of experiments.

评论 #32137213 未加载

评论 #32137609 未加载

ogriselalmost 3 years ago

There is also this very interesting 2017 paper:Towards Understanding Generalization of Deep Learning: Perspective of Loss LandscapesLei Wu, Zhanxing Zhu, Weinan E<a href="https://arxiv.org/abs/1706.10239" rel="nofollow">https://arxiv.org/abs/1706.10239</a>I think it was the first paper to study the volume of the basins of attraction of good global minima and used the poisoning scheme to highlight the frequency of bad global minima that are typically not reachable found via SGD on the original dataset without poisoning.

yobboalmost 3 years ago

The loss function is on the parameter space, and "wide basins" having better generalizations is equivalent to saying regularizing (in whatever way) gives better generalization, since regularization constrains the parameter/and or function space in that way.In small (two or three) dimensions, there are ways of visualizing overtraining/regularization/generalization with scatter plots (maybe coloured with output label) of activations in each layer. Training will form tighter "modes" in the activations, and the "low density" space between modes constitutes "undefined input space" to subsequent layers. Overtraining is when real data falls in these "dead" regions. The aim of regularization is to shape the activation distributions such that unseen data falls somewhere with non-zero density.Training loss does not give any information on generalization here unless it shows you're in a narrow "well". The loss landscapes are high-dimensional and non-obvious to reason about, even in tiny examples.

评论 #32138638 未加载

Kalanosalmost 3 years ago

This is why model simplicity is so important. When an algorithm has less parameters, it's forced to use those weights to find the most broadly applicable patterns possible, as opposed to noise, in the training data"Why might SGD prefer basins that are flatter?" It's because they look at the derivative. When the bottom of the valley is flat they don't have enough momentum to get out.I have observed the lottery ticket hypothesis.

charleshmartinalmost 3 years ago

<a href="https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/" rel="nofollow">https://calculatedcontent.com/2015/03/25/why-does-deep-learn...</a>

Suzuranalmost 3 years ago

Is it bad if my initial thought upon reading the headline was that someone was training a neural network to recognize the meme?

bilsbiealmost 3 years ago

I wanted to understand this but I just couldn’t get there.

评论 #32133349 未加载

lysecretalmost 3 years ago

Do we know things? Let's find out.

10 comments

MauranKilomalmost 3 years ago

评论 #32133741 未加载

评论 #32132205 未加载

评论 #32133548 未加载

evolvingstuffalmost 3 years ago

评论 #32134108 未加载

YeGoblynQueennealmost 3 years ago

评论 #32137213 未加载

评论 #32137609 未加载

ogriselalmost 3 years ago

yobboalmost 3 years ago

评论 #32138638 未加载

Kalanosalmost 3 years ago

charleshmartinalmost 3 years ago

<a href="https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/" rel="nofollow">https://calculatedcontent.com/2015/03/25/why-does-deep-learn...</a>

Suzuranalmost 3 years ago

Is it bad if my initial thought upon reading the headline was that someone was training a neural network to recognize the meme?

bilsbiealmost 3 years ago

I wanted to understand this but I just couldn’t get there.

评论 #32133349 未加载

lysecretalmost 3 years ago

Do we know things? Let's find out.

Neural Network Loss Landscapes: What do we know? (2021)

10 comments

Neural Network Loss Landscapes: What do we know? (2021)

10 comments