Its random forests ... each tree is trained on a <i></i>subset<i></i> of the data. You can split the massive dataset into chunks and train independently. That sidesteps the "big data" hangup.<p>If you look at the implementation for ski-learn, each tree emits a normalised probability vector for each prediction, those vectors are simply multiplied together to get the aggregate prediction, so its not very difficult to do yourself.<p>Although regardless, you are applying a batch learning technique anyway. You want an incremental learner for big data.
Any chance for you to run your benchmarks on this branch of Scikit-Learn? <a href="https://github.com/glouppe/scikit-learn/tree/trees-v2" rel="nofollow">https://github.com/glouppe/scikit-learn/tree/trees-v2</a> This will be shipped anytime soon :)<p>We have been working hard to reduce computing times and memory footprint (though, there is still a lot of improvement on that side).<p>(Unfortunately, I cannot run your benchmarks myself, because the compiled version of WiseRF requires a newer version of glibc than the one on my cluster, and crashes.)
Question: Why do I have to implement hyperparameter selection?<p>For me, the promise of in-the-cloud machine learning is that I can call 'train' method, and specify one single hyperparameter: training budget (i.e. $). Perhaps also the max time before I am returned a trained model.<p>That's it. Can you do that?