A new link to an old model could crack the mystery of deep learning

173 点作者 the__prestige超过 3 年前

18 条评论

Here's an alternative approach, that actually provides real world results<a href="https://calculatedcontent.com/2019/12/03/towards-a-new-theory-of-learning-statistical-mechanics-of-deep-neural-networks/" rel="nofollow">https://calculatedcontent.com/2019/12/03/towards-a-new-theor...</a>Using techniques from statistical mechanics and strongly correlated systems, we can compute the average-case-behavior of a real world DNNWe believe we can reproduce some of the results of the NTK by using a Gaussian Random Matrix. But if we use a more realistic, heavy tailed matrix, we get more practical resultsA early fork of the theory has been published in JMLR <a href="https://arxiv.org/abs/1810.01075" rel="nofollow">https://arxiv.org/abs/1810.01075</a>and the empirical results in Nature Communications <a href="https://www.nature.com/articles/s41467-021-24025-8" rel="nofollow">https://www.nature.com/articles/s41467-021-24025-8</a>and we have an open source tool , weightwatcher, which can be used in productionpip install weightwatcher<a href="https://github.com/CalculatedContent/WeightWatcher" rel="nofollow">https://github.com/CalculatedContent/WeightWatcher</a>Please give it a try and let me know if it is useful to you

评论 #28841010 未加载

sgt101超过 3 年前

By the time I finished my Ph.D. in 1997 I was stone cold certain that SVM's were equivalent to NN's, and that 3 layers could do anything. I saw Y.Lecuns work on handwriting and thought it an oddity. I gave up on machine learning as I thought it was all done and concentrated on applications.In 2011 I was staggered to realise what a load of bollocks I had been bongoogled into believing in, and annoyed.I am never going to fall for the "if we assume x, and y and only talk about this bit of the system then it's all the same" line again. It may be mathematically nice to prove things with simplifying assumptions, but the ICML community basically walked up a blind ally because of this desire for rigor and elegance in preference to reality.ML is not mathematics, because it's tied to data, and data is a product of nature. The study of nature is Science. This is the difference between String Theory and Physics, and it's the difference between Computer Science and ML as well.

评论 #28844828 未加载

erostrate超过 3 年前

A lot of effort went into these efforts to understand neural networks in terms or kernels or SVMs. To the best of my knowledge, these efforts have not inspired useful new architectures, nor have they made useful experimental predictions on real neural nets, nor have they had any significant impact on important machine learning benchmarks.I think some researchers are refusing to accept the idea that machine learning is very much an experimental science today, and the (very cool) mathematics of kernels, SVMs, empirical risk minimization, bayesian statistics, etc. are simply no longer useful in the large scale regime.My prediction is that there will be a useful "deep learning theory" in the future, but it will look a lot more like physics (such as Kaplan's scaling laws) than early 21st century machine learning mathematics/statistics.

评论 #28839930 未加载

评论 #28840998 未加载

评论 #28841399 未加载

评论 #28839569 未加载

评论 #28844855 未加载

评论 #28839603 未加载

homerowilson超过 3 年前

A nice, short, recent paper along these lines (not mentioned in the article) is "Every Model Learned by Gradient Descent Is Approximately a Kernel Machine": <a href="https://arxiv.org/abs/2012.00152" rel="nofollow">https://arxiv.org/abs/2012.00152</a>

shenberg超过 3 年前

Microsoft research has a really nice paper showing that the NTK explanation is very unlikely: <a href="https://www.microsoft.com/en-us/research/blog/three-mysteries-in-deep-learning-ensemble-knowledge-distillation-and-self-distillation/" rel="nofollow">https://www.microsoft.com/en-us/research/blog/three-mysterie...</a>

qPM9l3XJrF超过 3 年前

>By all accounts, deep neural networks like VGG have way too many parameters and should overfit.I keep seeing this claim pop up but I've never seen a citation to support it. My understanding is that e.g. statistical learning theory lets you prove theorems of the form "IF it has this number of parameters or fewer, THEN it won't overfit". First year logic is enough to know this theorem tells you nothing in the scenario where the model has lots of parameters. Can anyone provide a solid citation to support the claim that models with many parameters "should" overfit?BTW, another example of a model with lots of parameters which doesn't overfit is random forests / gradient boosting. I'm wondering if this "should overfit" claim is less of a global property of machine learning models in general, and more just a property of the particular models that this set of researchers have most familiarity with (SVMs).

评论 #28843927 未加载

r-zip超过 3 年前

I'd recommend anyone interested in the NTK and its limitations to check out Greg Yang's work [1] and the recent book by Roberts, Yaida, and Hanin [2]. There's clearly a lot more work to be done on neural network scaling limits; the NTK is overly simplistic (though useful).1: <a href="https://www.microsoft.com/en-us/research/people/gregyang/" rel="nofollow">https://www.microsoft.com/en-us/research/people/gregyang/</a>2: <a href="https://arxiv.org/abs/2106.10165" rel="nofollow">https://arxiv.org/abs/2106.10165</a>

评论 #28840107 未加载

randcraw超过 3 年前

The article is intriguing, but is the objective of the mathematicians of more than theoretical value?It seems to me that proving DNNs to be equivalent to SVM kernels won't improve them appreciably -- they won't become more accurate or precise, not faster, nor more economical. More importantly, the achilles heel of DNNs won't be fixed -- their inability to explain their reasoning or annotate their inner workings to enable tuning or repair.Replacing black box linear algebra with black box math does not seem like much of a step forward.

评论 #28843960 未加载

评论 #28841966 未加载

abetusk超过 3 年前

I've only skimmed the article, but the idea looks to be "Kernel methods" [0] (and showing their equivalence in the infinite middle layer width limit?).If anyone has more insight into the actual idea, I would appreciate an explanation.In the article, there are two advertisements to older articles. One is "New Theory Cracks Open the Black Box of Deep Learning" from 2017 and the other is "Foundations Built for a General Theory of Neural Networks" from 2019.[0] <a href="https://www.quantamagazine.org/a-new-link-to-an-old-model-could-crack-the-mystery-of-deep-learning-20211011/" rel="nofollow">https://www.quantamagazine.org/a-new-link-to-an-old-model-co...</a>

评论 #28839578 未加载

mark_l_watson超过 3 年前

Is there much mystery?SGD in a small number of dimensions is easy enough to understand, and as Hinton said in an online class I took years ago: to visualize very high numbers in dimensions, close your eyes and shout out the number. He was joking but his comment convinced me to disregard the concept of millions of billions of dimensions and instead concentrate on practical techniques that enable training many layers, architectures that are multi-headed, etc., and use model architectures that are known to work for different types of problems.

评论 #28839845 未加载

评论 #28844076 未加载

评论 #28843973 未加载

light_hue_1超过 3 年前

> By all accounts, deep neural networks like VGG have way too many parameters and should overfit. But they don’t.This isn't true at all. You can only make this statement relative to a dataset. And VGG is almost always trained on ImageNet. A standard VGG16 has 138 million parameters. ImageNet has 1 million images each is 244x244 (~50k). There are ~431 input data points per parameter in VGG16. VGG is not overparameterized for ImageNet.There are plenty of network/dataset pairs they could have given as an example, but this isn't it.

评论 #28844896 未加载

评论 #28844096 未加载

评论 #28843293 未加载

dekhn超过 3 年前

I remember being introduced to SVMs in the context of machine learning in biology before deep neural networks: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2559884/" rel="nofollow">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2559884/</a>I left the ML community in disgust at how unsavory these ML methods were.

评论 #28841549 未加载

ppod超过 3 年前

The explanations in quanta magazine are so good. They are more in depth than a usual pop-sci article, but very followable.

JustFinishedBSG超过 3 年前

I find the majority of the comments in this thread pretty concerning and quite honestly shocking.I didn't expect HN to be basically so obscurantist and to act so similarly to medieval alchemists (and that's not being fair to alchemists).So apparently theory is useless, all that matters is the "art" and empirical results ?Especially since quite a few of the comments here dismiss this work, and related, with a quick "well it's trivial/ well known?" and then procede to give their own personal explanation that show a clear misunderstanding of the topic.

评论 #28849838 未加载

jonnycomputer超过 3 年前

Incidentally, yesterday I submitted a wordpress blog post about some of these issues.<a href="https://news.ycombinator.com/item?id=28817666" rel="nofollow">https://news.ycombinator.com/item?id=28817666</a><a href="https://matloff.wordpress.com/2020/11/11/the-notion-of-double-descent/" rel="nofollow">https://matloff.wordpress.com/2020/11/11/the-notion-of-doubl...</a>

mvcalder超过 3 年前

When I first read about the NTK I was motivated to try an experiment to see if the kernel depended on where in the parameter space one evaluated it. I was surprised to see in my simple example the answer was "no". Here is a write up I did:<a href="https://arxiv.org/pdf/2104.05874.pdf" rel="nofollow">https://arxiv.org/pdf/2104.05874.pdf</a>

flerovium超过 3 年前

Why do I keep seeeing quanta magazine on hackernews? They've published a lot of very speculative results, often with simple inaccuracies.Can I filter this out?

YeGoblynQueenne超过 3 年前

The article leaves "generalisation" without further explanation and as usual, there's a lot of confusion about what, exactly, is meant by the claim that overparameterised deep neural networks "generalize astoundingly well to new data".So, what is meant is that such networks (let's call them complex networks for short) generalise well to test data. But what exactly is "test data"? Here's how it works.A team has some data, let's call it D. The team partitions the data to a training and test set, T₁ and T₂. The team trains their system on T₁ and tests it on T₂. To choose the model to test on T₂, the team may perform cross-validation which further partitions T₁ to k validation partitions, T₁₁, ..., T₁ₖ. In the most common cross-validation scheme, k-fold cross-validation, the team trains on k-1 validation partitions of T₁ and tests on one of the k validation partitions, trying all combinations of k-1 training and 1 testing, partitions _of T₁_ (because we're still in validation, which can be very confusing). At the end of this process the team have k models each of which is trained on k-1 different partitions, and tested on one of k different partitions, of the training set T₁. The team chooses the model with the best accuracy (or whatever the metric they're measuring) and then they test this model on T₂, the testing partition. Finally, the team report the accuracy (etc) of their trained model on the testing partition, T₂, and basically claim (though generally the claim is implicit) that the accuracy of their best model on T₂ is a more or less accurate estimate of the accuracy of the model on unseen data, in the real world, i.e. data not included in D, either as T₁ or T₂. Such truly unseen data (it was not available _to the team_ at training time) is sometimes referred to as "out of distribution" data and let's call it that for simplicity [1].So, with this background at hand, we now understand that when the article above (and neural net researchers) say that complex networks "generalise well", they mean "on test data". Which is mildly surprising given that various assumptions make it less surprising that they'd do well on training data etc etc.There are two key observations to make here.One is that complex networks generalise well on test data _when the researchers have access to the test data_. When researchers have access to test data, the training regime I outline above includes a further step: the team looks at the accuracy of their system on test data, find it to be abysmal, and, crestfallen, abandon their expensive research and start completely from scratch, because obviously they wouldn't just try to tweak their system to perform better on the test data! That would be essentially peeking at the test data and guiding their system to perform well on it (e.g. by tuning hyperparameters or by better "random" initialisation)!I'm kidding. Of course a team who finds their system performs badly on test data will tweak their system to perform better on the test data. They'll peek. And peek again. Not only they'll peek, they'll conduct an automated hyperparameter-tuning search (a "grid search") _on the test data_! That is what's known in science, I believe, as a "fishing expedition".Two, in the typical training regime, the training partition T₁ is 80% of D, the entire dataset, and T₂ is 20% of D. And when T₁ is partitioned for cross-validation, the validation-training partition (the k-1 folds trained on) is also usually 4 times larger than the validation-testing partition (the single fold tested on). Why? Because complex networks need Big Data to train on. But, what happens when you train on 80% of your data and test on 20% of it? What happens is that your estimates of accuracy are not very good estimates, because, even if your data is Really Big, 20% of it is likely to miss a big chunk of the variation found in the 80% of the data, and so you're basically only estimating your system's accuracy on a small sub-set of the features that it needs to represent to be said to "generalise well". For this reason, complex networks, like all Big Data approaches, are pretty much crap at generalising to "O.O.D." data, or, more to the point, estimates of their accuracy on OOD data are just pretty bad. In practice, deep neural net systems that have amazing performance "in the lab", can be expected to lose 20-40% of their performance "in the real world" [2].Bottom line, there's no big mystery why complex networks "generalise" so well and there is no need to seek an explanation for this in kernel machines. Complex networks generalise so well because the kind of generalisation they're so good at is the kind of generalisation achieved by humans tweaking the network until it overfits _to the test data_. And that's the worst kind of overfitting.__________[1] Under PAC-Learning assumptions we can expect the performance of any machine learning system on data that is truly "out of distribution", in the sense that it is drawn from a distribution radically different than the distribution of the training data, to be really bad, because we assume distributional consistency between training and "true" data. But that's a bit of a quibble, and "OOD" has established itself as a half-understood jargon term so I let this rest.[2] I had a reference for that somewhere... can't find it. You'll have to take my word for it. Would I lie to you?

评论 #28845968 未加载