<i>The experiments we conducted emphasize that the effective capacity of several successful neural network architectures is large enough to shatter the training data. Consequently, these models are in principle rich enough to memorize the training data.</i><p>So they're fitting elephants[1].<p>I've been trying to use DeepSpeech[2] lately for a project, would be interesting to see the results for that.<p>I guess it could also be a decent test for your model? Retrain it with random labels and if it succeeds the model is just memorizing, so either reduce model complexity or add more training data?<p>[1]: <a href="https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elephant/" rel="nofollow">https://www.johndcook.com/blog/2011/06/21/how-to-fit-an-elep...</a><p>[2]: <a href="https://github.com/mozilla/DeepSpeech" rel="nofollow">https://github.com/mozilla/DeepSpeech</a>
I tend to think this is a result of classification's information density being too low. You can 'learn' a classification problem in the same way with a hash function: take a hash of each image, and memorize the hash and label. Then you only need to 'learn' a very tiny amount of data relative to the size of the dataset to get zero loss. A series of random projections can also function as a crude hash function, and this is likely how NN memorization works.<p>Generative models, on the other hand, don't allow this kind of data reduction trick. If you need to predict (say) every sample of an audio stream conditioned on the previous audio samples, you really do need to memorize the whole dataset (not just a hash of each item) to get to zero loss, because the information density of the output is still very high.<p>And then you've got BERT, which is basically a generative model for language, used for downstream tasks. The information-dense task probably helps with the memorization 'problem' so good features are learned that adapt nicely to other tasks. (and as others have said, memorization may not be a problem in practice. Sometimes it's actually the right thing to do.)
This is only tangentially my field, so pure speculation.<p>I suppose it's possible that generalized minima are numerically more common than overfitted minima in an over-parameterized model, so probabilistically SGD will find a more general minima than not, regardless of regularization.
I would be interested to know if there is any effect on the time to convergence between the two setups (i.e. training with real labels Vs random labels). Is it in any way easier/harder to memorize everything Vs extracting a generalizable representation? Edit: Ah, I see they mention they not only investigated non-convergence but also whether training slows down. So randomising/corrupting does impact the convergence rate. This means it is more work/effort to memorize things which I guess is interesting.
> Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training.<p>I'd say it's more about the simplicity of the task and quality of the data.