TechEcho

7 comments

The experiments we conducted emphasize that the effective capacity of several successful neural network architectures is large enough to shatter the training data. Consequently, these models are in principle rich enough to memorize the training data.So they're fitting elephants[1].I've been trying to use DeepSpeech[2] lately for a project, would be interesting to see the results for that.I guess it could also be a decent test for your model? Retrain it with random labels and if it succeeds the model is just memorizing, so either reduce model complexity or add more training data?[1]: <a href="https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elephant/" rel="nofollow">https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elep...</a>[2]: <a href="https://github.com/mozilla/DeepSpeech" rel="nofollow">https://github.com/mozilla/DeepSpeech</a>

评论 #26347604 未加载

评论 #26348018 未加载

评论 #26350025 未加载

评论 #26347653 未加载

bltabout 4 years ago

Please add the (still) to the HN post title. The original version of the paper without (still) in the title is several years old.

评论 #26347875 未加载

评论 #26354603 未加载

sdenton4about 4 years ago

I tend to think this is a result of classification's information density being too low. You can 'learn' a classification problem in the same way with a hash function: take a hash of each image, and memorize the hash and label. Then you only need to 'learn' a very tiny amount of data relative to the size of the dataset to get zero loss. A series of random projections can also function as a crude hash function, and this is likely how NN memorization works.Generative models, on the other hand, don't allow this kind of data reduction trick. If you need to predict (say) every sample of an audio stream conditioned on the previous audio samples, you really do need to memorize the whole dataset (not just a hash of each item) to get to zero loss, because the information density of the output is still very high.And then you've got BERT, which is basically a generative model for language, used for downstream tasks. The information-dense task probably helps with the memorization 'problem' so good features are learned that adapt nicely to other tasks. (and as others have said, memorization may not be a problem in practice. Sometimes it's actually the right thing to do.)

评论 #26350757 未加载

评论 #26352726 未加载

评论 #26355830 未加载

benlivengoodabout 4 years ago

This is only tangentially my field, so pure speculation.I suppose it's possible that generalized minima are numerically more common than overfitted minima in an over-parameterized model, so probabilistically SGD will find a more general minima than not, regardless of regularization.

评论 #26347946 未加载

anonymousDanabout 4 years ago

I would be interested to know if there is any effect on the time to convergence between the two setups (i.e. training with real labels Vs random labels). Is it in any way easier/harder to memorize everything Vs extracting a generalizable representation? Edit: Ah, I see they mention they not only investigated non-convergence but also whether training slows down. So randomising/corrupting does impact the convergence rate. This means it is more work/effort to memorize things which I guess is interesting.

clircleabout 4 years ago

> Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training.I'd say it's more about the simplicity of the task and quality of the data.

vonsydovabout 4 years ago

The whole point of neural networks was that you don't need to think hard about generalizations.

评论 #26349050 未加载

7 comments

magicalhippoabout 4 years ago

评论 #26347604 未加载

评论 #26348018 未加载

评论 #26350025 未加载

评论 #26347653 未加载

bltabout 4 years ago

Please add the (still) to the HN post title. The original version of the paper without (still) in the title is several years old.

评论 #26347875 未加载

评论 #26354603 未加载

sdenton4about 4 years ago

评论 #26350757 未加载

评论 #26352726 未加载

评论 #26355830 未加载

benlivengoodabout 4 years ago

评论 #26347946 未加载

anonymousDanabout 4 years ago

clircleabout 4 years ago

vonsydovabout 4 years ago

The whole point of neural networks was that you don't need to think hard about generalizations.

评论 #26349050 未加载

Understanding deep learning requires rethinking generalization

7 comments

Understanding deep learning requires rethinking generalization

7 comments