A note on dropout:<p>If your layer size is relatively small (not hundreds or thousands of nodes), dropout is usually detrimental and a more traditional regularization method such as weight-decay is superior.<p>For the size networks Hinton et al are playing with nowadays (with thousands of nodes in a layer), dropout is good, though.
Who is Arno Candel and why should we pay attention to his tips on training neural networks? Anyone who suggests grid search for metaparameter tuning is out of touch with the consensus among experts in deep learning. A lot of people are coming out of the woodwork and presenting themselves as experts in this exciting area because it has had so much success recently, but most of them seem to be beginners. Having lots of beginners learning is fine and healthy, but a lot of these people act as if they are experts.
I would just like to link to my comments from before for people who maybe curious:<p><a href="https://news.ycombinator.com/item?id=7803101" rel="nofollow">https://news.ycombinator.com/item?id=7803101</a><p>I will also add that looking in to hessian free for training over conjugate gradient/LBFGS/SGD for feed forward nets has proven to be amazing[1].<p>Recursive nets I'm still playing with yet, but based on the work by socher, they used LBFGS just fine.<p>[1]: <a href="http://www.cs.toronto.edu/~rkiros/papers/shf13.pdf" rel="nofollow">http://www.cs.toronto.edu/~rkiros/papers/shf13.pdf</a><p>[2]: <a href="http://socher.org/" rel="nofollow">http://socher.org/</a>
A question about the actual slides: why don't they use unsupervised pretraining (i.e. Sparse Autoencoder) for predicting MNIST? Is it just to show that they don't need pretraining to achieve good results or is there something deeper?
Direct link to slides: <a href="http://www.slideshare.net/0xdata/h2o-distributed-deep-learning-by-arno-candel-071614" rel="nofollow">http://www.slideshare.net/0xdata/h2o-distributed-deep-learni...</a>