I'm going to check back later to see if anyone manages to reproduce it. Perhaps by the time it's presented at NIPS.<p>A twitter conversation reflecting some scepticism, but agreeing it would be interesting if it all checks out:
<a href="https://twitter.com/fchollet/status/771862837819867136" rel="nofollow">https://twitter.com/fchollet/status/771862837819867136</a>
Update on this: it has been withdrawn: <a href="https://arxiv.org/abs/1608.04062" rel="nofollow">https://arxiv.org/abs/1608.04062</a>
As far as I understand this, these guys claim they can train convolutional and many other types of deep neural nets faster by pretraining each layer with a new unsupervised technique via which the layer sort of learns to compress its inputs (a local optimization problem), and then they fine tune the whole network end-to-end with supervised SGD and backpropagation as usual. They have not released code, so no one else has replicated this yet -- as far as I know.<p>If the claim holds, the implication is that layers can <i>quickly</i> learn much of what they need to learn <i>locally</i>, that is, without requiring backpropagation of gradients from potentially very distant layers. I can't help but wonder if this opens the door for more efficient asynchronous/parallel/distributed training of layers, potentially leading to models that update themselves continuously (i.e., "online" instead of in a batch process).<p>I wouldn't be surprised if the claim holds. There is mounting evidence that standard end-to-end backpropagation is a rather inefficient learning mechanism. For example, we now know that deep neural nets can be trained with <i>approximate gradients</i> obtained by shifting bits to get the sign and order of magnitude of the gradient roughly right.[1] In some cases it's even possible to restrict learning to use binary weights.[2] More recently, we have learned that it's possible to use "helper" linear models during training <i>to predict what the gradients will be</i> for each layer, in-between true-gradient updates, allowing layers to update their parameters locally during backpropagation.[3] Finally, don't forget that in the late 2000's, AI researchers were doing a lot of interesting work with unsupervised layer-wise training (e.g., DBNs composed of RBMs, stacked autoencoders).[4]<p>This is a fascinating area of research with potentially huge payoffs. For example, it would be really neat if we find there's a "general" algorithm via which layers can learn locally from inputs continuously ("online"), allowing us to combine layers into deep neural nets for specific tasks as needed.<p>[1] <a href="https://arxiv.org/abs/1510.03009" rel="nofollow">https://arxiv.org/abs/1510.03009</a><p>[2] <a href="https://arxiv.org/abs/1602.02830" rel="nofollow">https://arxiv.org/abs/1602.02830</a><p>[3] <a href="https://deepmind.com/blog#decoupled-neural-interfaces-using-synthetic-gradients" rel="nofollow">https://deepmind.com/blog#decoupled-neural-interfaces-using-...</a><p>[4] <a href="https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf" rel="nofollow">https://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf</a><p>EDITS: Expanded the original comment so it conveys better what I actually meant to write, while keeping language as casual and informal as possible. Also, I softened the tone of my more speculative observations.