科技回声

10 条评论

fwilliams超过 6 年前

It's worth noting that the primary result of this paper has only to do with the error on the training data under empirical risk minimization. Zero training error =/= a model that generalizes. For any optimization problem, you can always add enough parameters to achieve zero error on a problem over a finite training set (imagine introducing enough variables to fully memorize the map from inputs to labels).The major contribution of the work is showing that ResNet needs a number of parameters which is polynomial in the dataset size to converge to a global optimum in contrast to traditional neural nets which require an exponential number of parameters.

评论 #18438505 未加载

评论 #18438576 未加载

iandanforth超过 6 年前

My hope is that as these bounds are refined we can start to do BotE calculations such as, "I have 50k training images of 512x512x3 and 1k classes, this means I'll need a Convolutional Resnet of at most 12 layers and 12M params to fit the training data so let's start at half that." Rather than today which is 'let's use resnet 101 and see if that works.'

评论 #18437326 未加载

ramgorur超过 6 年前

I did not understand the paper very well.1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.Please correct me if I am wrong.

评论 #18436995 未加载

评论 #18436934 未加载

评论 #18436208 未加载

评论 #18436297 未加载

charleshmartin超过 6 年前

An excellent paper which uses (some of the) results we have also found studying the weight matrices of neural networks...namely that they rarely undergo rank collapse<a href="https://calculatedcontent.com/2018/09/21/rank-collapse-in-deep-learning/" rel="nofollow">https://calculatedcontent.com/2018/09/21/rank-collapse-in-de...</a>But they miss something..the weight matrices also display power law behavior.<a href="https://calculatedcontent.com/2018/09/09/power-laws-in-deep-learning/" rel="nofollow">https://calculatedcontent.com/2018/09/09/power-laws-in-deep-...</a>This is also important because it was suggested in the early 90s that Heavy Tailed Spin Glasses would have a single local mimima.This fact is the basis of my early suggestion that DNNs would exhibit a spin funnel

gogut超过 6 年前

This paper appears in November, but in fact, Allen-Zhu (MSR, <a href="http://people.csail.mit.edu/zeyuan/" rel="nofollow">http://people.csail.mit.edu/zeyuan/</a> ) already posted his result in Oct. This is their first paper in Oct:<a href="https://arxiv.org/pdf/1810.12065.pdf" rel="nofollow">https://arxiv.org/pdf/1810.12065.pdf</a>, this is their second paper in Nov <a href="https://arxiv.org/pdf/1811.03962.pdf" rel="nofollow">https://arxiv.org/pdf/1811.03962.pdf</a> . In MSR Oct paper, they proved how to train RNN (which is even harder than DNN). In their Nov paper, they proved how to train DNN. Compared to their Oct one, the Nov one is actually much easier. The reason is, in RNN, every layer has the same weight matrix, but in DNN every layer could have different weight matrices. Originally, they were not planning to write this DNN paper. Since someone is complaining that RNN is not multilayer neural network, that’s why they did it.In summary, the difference between MSR paper and this paper is: if H denotes the number of layers, let m denote the number of hidden nodes. MRS paper can show we only need to assume m > poly (H), using SGD, the model can find the global optimal. However, in Du et al.’s work, they have a similar result, but they have to assume m > 2^{O(H)}. Compared to MSR paper, Du et al.’s paper is actually pretty trivial.

brentjanderson超过 6 年前

Although I'm no expert, isn't this result an incredibly important contribution? This paper claims to prove that:> The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet).If this variant of gradient descent is able to reach global minima in polynomial time, and if neural networks are proven to approximate any function, then ostensibly this technique could be used to guarantee the lowest error possible in approximating any function. This seems incredibly important. Can someone correct my reading of the abstract?

评论 #18435616 未加载

评论 #18435330 未加载

评论 #18435318 未加载

评论 #18435617 未加载

juskrey超过 6 年前

How do you gradually detect a Dirac stick?

pj_mukh超过 6 年前

With this much wide distribution of ML algorithms in use, its still funny to see papers begin with "One of the mysteries of deep learning is that...." and then goes on to lay out the multiple ways we have no idea why some of these DL techniques work.

评论 #18438649 未加载

orf超过 6 年前

> One of the mysteries in deep learning is random initialized first order methods like gradient descent achieve zero training loss, even if the labels are arbitraryCan someone expand on this? I've never heard of this before, at least not in the general case.

评论 #18437042 未加载

lbj超过 6 年前

I cant claim to fully understand the proof, but these guys have done an amazing job in terms of furthering our understanding of deep nets.

10 条评论

fwilliams超过 6 年前

评论 #18438505 未加载

评论 #18438576 未加载

iandanforth超过 6 年前

评论 #18437326 未加载

ramgorur超过 6 年前

评论 #18436995 未加载

评论 #18436934 未加载

评论 #18436208 未加载

评论 #18436297 未加载

charleshmartin超过 6 年前

gogut超过 6 年前

brentjanderson超过 6 年前

评论 #18435616 未加载

评论 #18435330 未加载

评论 #18435318 未加载

评论 #18435617 未加载

juskrey超过 6 年前

How do you gradually detect a Dirac stick?

pj_mukh超过 6 年前

评论 #18438649 未加载

orf超过 6 年前

评论 #18437042 未加载

lbj超过 6 年前

I cant claim to fully understand the proof, but these guys have done an amazing job in terms of furthering our understanding of deep nets.

Gradient Descent Finds Global Minima of Deep Neural Networks

10 条评论

Gradient Descent Finds Global Minima of Deep Neural Networks

10 条评论