The article leaves "generalisation" without further explanation and as usual,
there's a lot of confusion about what, exactly, is meant by the claim that
overparameterised deep neural networks "generalize astoundingly well to new
data".<p>So, what is meant is that such networks (let's call them complex networks for
short) generalise well <i>to test data</i>. But what exactly is "test data"? Here's
how it works.<p>A team has some data, let's call it D. The team partitions the data to a
training and test set, T₁ and T₂. The team trains their system on T₁ and tests
it on T₂. To choose the model to test on T₂, the team may perform
cross-validation which further partitions T₁ to k validation partitions, T₁₁,
..., T₁ₖ. In the most common cross-validation scheme, k-fold cross-validation,
the team trains on k-1 validation partitions of T₁ and tests on one of the k
validation partitions, trying all combinations of k-1 training and 1 testing,
partitions _of T₁_ (because we're still in validation, which can be very
confusing). At the end of this process the team have k models each of which is
trained on k-1 different partitions, and tested on one of k different
partitions, of the training set T₁. The team chooses the model with the best
accuracy (or whatever the metric they're measuring) and then they test this
model on T₂, the testing partition. Finally, the team report the accuracy (etc)
of their trained model on the testing partition, T₂, and basically claim (though
generally the claim is implicit) that the accuracy of their best model on T₂ is
a more or less accurate estimate of the accuracy of the model on unseen data, in
the real world, i.e. data not included in D, either as T₁ or T₂. Such truly
unseen data (it was not available _to the team_ at training time) is sometimes
referred to as "out of distribution" data and let's call it that for simplicity
[1].<p>So, with this background at hand, we now understand that when the article above
(and neural net researchers) say that complex networks "generalise well", they
mean "on test data". Which is mildly surprising given that various assumptions
make it less surprising that they'd do well on training data etc etc.<p>There are two key observations to make here.<p>One is that complex networks generalise well on test data _when the researchers
have access to the test data_. When researchers have access to test data, the
training regime I outline above includes a further step: the team looks at the
accuracy of their system on test data, find it to be abysmal, and, crestfallen,
abandon their expensive research and start completely from scratch, because
obviously they wouldn't just try to tweak their system to perform better on the
test data! That would be essentially peeking at the test data and guiding their
system to perform well on it (e.g. by tuning hyperparameters or by better
"random" initialisation)!<p>I'm kidding. <i>Of course</i> a team who finds their system performs badly on test
data will tweak their system to perform better on the test data. They'll peek.
And peek again. Not only they'll peek, they'll conduct an automated
hyperparameter-tuning search (a "grid search") _on the test data_! That is
what's known in science, I believe, as a "fishing expedition".<p>Two, in the typical training regime, the training partition T₁ is 80% of D, the
entire dataset, and T₂ is 20% of D. And when T₁ is partitioned for
cross-validation, the validation-training partition (the k-1 folds trained on)
is also usually 4 times larger than the validation-testing partition (the single
fold tested on). Why? Because complex networks need Big Data to train on. But,
what happens when you train on 80% of your data and test on 20% of it? What
happens is that your estimates of accuracy are not very good estimates, because,
even if your data is Really Big, 20% of it is likely to miss a big chunk of the
variation found in the 80% of the data, and so you're basically only estimating
your system's accuracy on a small sub-set of the features that it needs to
represent to be said to "generalise well". For this reason, complex networks,
like all Big Data approaches, are pretty much crap at generalising to "O.O.D."
data, or, more to the point, estimates of their accuracy on OOD data are just
pretty bad. In practice, deep neural net systems that have amazing performance
"in the lab", can be expected to lose 20-40% of their performance "in the real
world" [2].<p>Bottom line, there's no big mystery why complex networks "generalise" so well
and there is no need to seek an explanation for this in kernel machines. Complex
networks generalise so well because the kind of generalisation they're so good
at is the kind of generalisation achieved by humans tweaking the network until
it overfits _to the test data_. And that's the worst kind of overfitting.<p>__________<p>[1] Under PAC-Learning assumptions we can expect the performance of any machine
learning system on data that is truly "out of distribution", in the sense that
it is drawn from a distribution radically different than the distribution of the
training data, to be really bad, because we assume distributional consistency
between training and "true" data. But that's a bit of a quibble, and "OOD" has
established itself as a half-understood jargon term so I let this rest.<p>[2] I had a reference for that somewhere... can't find it. You'll have to take
my word for it. Would I lie to you?