This paper appears in November, but in fact, Allen-Zhu (MSR, <a href="http://people.csail.mit.edu/zeyuan/" rel="nofollow">http://people.csail.mit.edu/zeyuan/</a> ) already posted his result in Oct. This is their first paper in Oct:<a href="https://arxiv.org/pdf/1810.12065.pdf" rel="nofollow">https://arxiv.org/pdf/1810.12065.pdf</a>, this is their second paper in Nov <a href="https://arxiv.org/pdf/1811.03962.pdf" rel="nofollow">https://arxiv.org/pdf/1811.03962.pdf</a>
. In MSR Oct paper, they proved how to train RNN (which is even harder than DNN). In their Nov paper, they proved how to train DNN. Compared to their Oct one, the Nov one is actually much easier. The reason is, in RNN, every layer has the same weight matrix, but in DNN every layer could have different weight matrices. Originally, they were not planning to write this DNN paper. Since someone is complaining that RNN is not multilayer neural network, that’s why they did it.<p>In summary, the difference between MSR paper and this paper is: if H denotes the number of layers, let m denote the number of hidden nodes. MRS paper can show we only need to assume m > poly (H), using SGD, the model can find the global optimal. However, in Du et al.’s work, they have a similar result, but they have to assume m > 2^{O(H)}. Compared to MSR paper, Du et al.’s paper is actually pretty trivial.