TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Understanding deep learning requires rethinking generalization

165 pointsby tmfiabout 4 years ago

7 comments

magicalhippoabout 4 years ago
<i>The experiments we conducted emphasize that the effective capacity of several successful neural network architectures is large enough to shatter the training data. Consequently, these models are in principle rich enough to memorize the training data.</i><p>So they&#x27;re fitting elephants[1].<p>I&#x27;ve been trying to use DeepSpeech[2] lately for a project, would be interesting to see the results for that.<p>I guess it could also be a decent test for your model? Retrain it with random labels and if it succeeds the model is just memorizing, so either reduce model complexity or add more training data?<p>[1]: <a href="https:&#x2F;&#x2F;www.johndcook.com&#x2F;blog&#x2F;2011&#x2F;06&#x2F;21&#x2F;how-to-fit-an-elephant&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.johndcook.com&#x2F;blog&#x2F;2011&#x2F;06&#x2F;21&#x2F;how-to-fit-an-elep...</a><p>[2]: <a href="https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;DeepSpeech" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mozilla&#x2F;DeepSpeech</a>
评论 #26347604 未加载
评论 #26348018 未加载
评论 #26350025 未加载
评论 #26347653 未加载
bltabout 4 years ago
Please add the (still) to the HN post title. The original version of the paper without (still) in the title is several years old.
评论 #26347875 未加载
评论 #26354603 未加载
sdenton4about 4 years ago
I tend to think this is a result of classification&#x27;s information density being too low. You can &#x27;learn&#x27; a classification problem in the same way with a hash function: take a hash of each image, and memorize the hash and label. Then you only need to &#x27;learn&#x27; a very tiny amount of data relative to the size of the dataset to get zero loss. A series of random projections can also function as a crude hash function, and this is likely how NN memorization works.<p>Generative models, on the other hand, don&#x27;t allow this kind of data reduction trick. If you need to predict (say) every sample of an audio stream conditioned on the previous audio samples, you really do need to memorize the whole dataset (not just a hash of each item) to get to zero loss, because the information density of the output is still very high.<p>And then you&#x27;ve got BERT, which is basically a generative model for language, used for downstream tasks. The information-dense task probably helps with the memorization &#x27;problem&#x27; so good features are learned that adapt nicely to other tasks. (and as others have said, memorization may not be a problem in practice. Sometimes it&#x27;s actually the right thing to do.)
评论 #26350757 未加载
评论 #26352726 未加载
评论 #26355830 未加载
benlivengoodabout 4 years ago
This is only tangentially my field, so pure speculation.<p>I suppose it&#x27;s possible that generalized minima are numerically more common than overfitted minima in an over-parameterized model, so probabilistically SGD will find a more general minima than not, regardless of regularization.
评论 #26347946 未加载
anonymousDanabout 4 years ago
I would be interested to know if there is any effect on the time to convergence between the two setups (i.e. training with real labels Vs random labels). Is it in any way easier&#x2F;harder to memorize everything Vs extracting a generalizable representation? Edit: Ah, I see they mention they not only investigated non-convergence but also whether training slows down. So randomising&#x2F;corrupting does impact the convergence rate. This means it is more work&#x2F;effort to memorize things which I guess is interesting.
clircleabout 4 years ago
&gt; Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training.<p>I&#x27;d say it&#x27;s more about the simplicity of the task and quality of the data.
vonsydovabout 4 years ago
The whole point of neural networks was that you don&#x27;t need to think hard about generalizations.
评论 #26349050 未加载