"In contrast to deep neural networks which require great effort in hyper-parameter tuning, gcForest is much easier to train."<p>Hyperparameter tuning is not as much of an issue with deep neural networks anymore. Thanks to BatchNorm and more robust optimization algorithms, most of the time you can simply use Adam with a default learning rate of 0.001 and do pretty well. Dropout is not even necessary with many models that use BatchNorm nowadays, so generally tuning there is not an issue either. Many layers of 3x3 conv with stride 1 is still magical.<p>Basically: deep NNs can work pretty well with little to no tuning these days. The defaults just work.
I've always found it curious that Neural Networks get so much hype when xgboost (gradient boosted decision trees) is by far the most popular and accurate algorithm for most Kaggle competitions. While neural networks are better for image processing types of problems, there are a wide variety of machine learning problems where decision tree methods perform better and are much easier to implement.
I don't know about the others, but the two visions dataset they compare to (MNIST and the face recognition one) are small datasets and the CNN they compare to doesn't seem very state of the art.<p>It also seems each layer of random forest just concatenates a class distribution to the original feature vector. So this doesn't seem to get the same "hierarchy of features" benefit that you get in large-scale CNN and DNN.
Was about to joke about Deep Support Vector Machines, but found out they exist too: <a href="https://www.esat.kuleuven.be/sista/ROKS2013/files/presentations/DSVM_ROKS_2013_WIERING.pdf" rel="nofollow">https://www.esat.kuleuven.be/sista/ROKS2013/files/presentati...</a> <a href="http://deeplearning.net/wp-content/uploads/2013/03/dlsvm.pdf" rel="nofollow">http://deeplearning.net/wp-content/uploads/2013/03/dlsvm.pdf</a>
No Free Lunch theorem refesher:<p>"if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems"<p><a href="https://en.m.wikipedia.org/wiki/No_free_lunch_theorem" rel="nofollow">https://en.m.wikipedia.org/wiki/No_free_lunch_theorem</a>
While optimizations to cost by ditching GPUs as a requirement are important (and presumably these systems benefit from GPU optimization as well, seems unclear from my skim of the paper), cheaper training is NOT just about saving your wallet.<p>A real emerging area of opportunity is having systems train new systems. This has numerous applications, including assisting DSEs in the construction of new systems or allowing expert systems to learn more over time and even integrate new techniques into a currently deployed system.<p>I'n not an expert here, but I'd like to be, so I'm definitely going to ask my expert friends more about this.
Please note the CPUs they have used are pretty advanced: 2x Intel E5 2670 v3 CPU (24 cores) - approx. price $1.5k per unit (<a href="http://ark.intel.com/products/81709/Intel-Xeon-Processor-E5-2670-v3-30M-Cache-2_30-GHz" rel="nofollow">http://ark.intel.com/products/81709/Intel-Xeon-Processor-E5-...</a>).<p>Looking forward to try the code (especially on CIFAR or ImageNet), Zhi-Hua Zhou, one of the authors, said they are going to publish it soon.
None of these experiments actually do anything to show feature learning - if this is the claim, I would like to see a transfer learning experiment. I would be surprised if this works well, since they can't jointly optimize their layers (so you can't just use ImageNet to induce a good representation). Not quite clear why we should think that trees will turn out to be inherently cheaper that a DNN with similar accuracy, unless perhaps the model structure encodes a prior which matches the distribution of the problem?
The method's performance on MNIST is relatively mediocre. You might think 98.96% is amazing, but it's about relative performance. It is a relatively easy exercise nowadays to get above 99% with neural nets. Even I can get that kind of performance with hand-written Python neural nets, on the CPU, with no convolutions.<p>For the rest of the (non-image) datasets, it's already common knowledge that boosting methods are competitive with neural nets.
Would like to see three things coming out of this<p>1. R code implementation (could probably write this myself but would make things easier)<p>2. How to get feature importance? Otherwise difficult to implement in business context.<p>3. Better benchmarks
Progress in this field is astonishing and it really propagates to the masses in the form of easy-to-use black boxes with a pinch of undergraduate-level maths. Just wow!