TechEcho

tlarkworthyalmost 12 years ago

Its random forests ... each tree is trained on a subset of the data. You can split the massive dataset into chunks and train independently. That sidesteps the "big data" hangup.If you look at the implementation for ski-learn, each tree emits a normalised probability vector for each prediction, those vectors are simply multiplied together to get the aggregate prediction, so its not very difficult to do yourself.Although regardless, you are applying a batch learning technique anyway. You want an incremental learner for big data.

评论 #6050164 未加载

评论 #6050163 未加载

glouppealmost 12 years ago

Any chance for you to run your benchmarks on this branch of Scikit-Learn? <a href="https://github.com/glouppe/scikit-learn/tree/trees-v2" rel="nofollow">https://github.com/glouppe/scikit-learn/tree/trees-v2</a> This will be shipped anytime soon :)We have been working hard to reduce computing times and memory footprint (though, there is still a lot of improvement on that side).(Unfortunately, I cannot run your benchmarks myself, because the compiled version of WiseRF requires a newer version of glibc than the one on my cluster, and crashes.)

bravuraalmost 12 years ago

Question: Why do I have to implement hyperparameter selection?For me, the promise of in-the-cloud machine learning is that I can call 'train' method, and specify one single hyperparameter: training budget (i.e. $). Perhaps also the max time before I am returned a trained model.That's it. Can you do that?

评论 #6050284 未加载

评论 #6050159 未加载

Benchmarking Random Forest Classification

3 comments

Benchmarking Random Forest Classification

3 comments