科技回声

6 条评论

HamSession超过 11 年前

This is a interesting take and why I love machine learning and its intersection with HPC.First the seemingly blind decision to implement an SVM for improved performance. An SVM isn't magical in fact an SVM and neural network are equivalent with the SVM being the general case. SVMs suffer from the the same problems as neural networks in that your # Hidden Nodes/Activation function is the same as trying to choose your kernel function. When looking at your problem you have to ask yourself.1) Is the model time varying? -> NN2) Very large N dimensional search space -> SVMSecondly even without changing algorithms you can get a significant accuracy improvement by examining your features. Features are the most important part of machine learning (garbage-in/garbage-out). Even simple classifiers such as Naive Bayes can do well if given the right feature set. There are multiple methods to examine your features such as ReliefF another is ANOVA. If you find your features are not good enough try unsupervised feature detection and learn more about the problem domain and coming up with your own features.The final issue specifically with HPC and machine learning is even given 100 cores your algorithms may not speedup. Many machine learning algorithms are built to be iterative in nature and do not lend themselves to the becoming parallel. This means that MapReduce must be invoked at each iteration. As you scale up the number of available cores the overhead of startup and shutdown of your cluster at each iteration overrides your gain in performance, as many nodes will finish faster than others and just sit and wait.The solution to all of this is simple1) Get your features correct2) Try new algorithms<pre><code> - Try online learning algorithm first like vowpal wabbit here https://github.com/JohnLangford/vowpal_wabbit/ </code></pre> 3) HPC<pre><code> - Apache Spark http://spark.incubator.apache.org/ or GraphLab http://graphlab.org/ or (if personal computer only) GraphChi http://graphlab.org/graphchi/ - Both support HPC with graph centric framework - Orders of magnitude faster than Hadoop - Both built on top of Hadoop HDFS so connect and go </code></pre> Hope this helps Everyone out there.Have fun and try to solve some cool problems.

评论 #6413660 未加载

pnachbaur超过 11 年前

I'm still surprised there is no SVM implementation for Mahout (<a href="http://stackoverflow.com/questions/10482646/recently-svm-implementation-was-added-into-mahout-i-am-planning-to-use-svm-an" rel="nofollow">http://stackoverflow.com/questions/10482646/recently-svm-imp...</a>).

chcleaves超过 11 年前

A few years ago some SVM code was contributed to the Mahout project, but as of yet, it still doesn't appear to have a working implementation. It seems one can tweak existing Mahout functions a bit in order to accomplish the same sort of thing, but Mike went ahead and started working on an SVM implementation when he initially discovered it wasn't fully implemented. Given a package like sklearn (part of the Python scipy package), it's not so hard to implement a scheme similar to the one described in the blog once you know what to do.

PaulHoule超过 11 年前

I'd like to see some classification performance numbers, ROC curves, etc.

eakyol超过 11 年前

Paul,We don't have any published performance numbers at the moment as we just implemented our cluster within our production environment. Looking to do a post-facto write up on that in a bit.

konstantintin超过 11 年前

how much of an increase in accuracy is gained by using the full collection of data rather than sampling?

6 条评论

HamSession超过 11 年前

评论 #6413660 未加载

pnachbaur超过 11 年前

chcleaves超过 11 年前

PaulHoule超过 11 年前

I'd like to see some classification performance numbers, ROC curves, etc.

eakyol超过 11 年前

Paul,We don't have any published performance numbers at the moment as we just implemented our cluster within our production environment. Looking to do a post-facto write up on that in a bit.

konstantintin超过 11 年前

how much of an increase in accuracy is gained by using the full collection of data rather than sampling?

Support Vector Machines and Hadoop: Theory vs. Practice

6 条评论

Support Vector Machines and Hadoop: Theory vs. Practice

6 条评论