TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Support Vector Machines and Hadoop: Theory vs. Practice

34 点作者 techtime77超过 11 年前

6 条评论

HamSession超过 11 年前
This is a interesting take and why I love machine learning and its intersection with HPC.<p>First the seemingly blind decision to implement an SVM for improved performance. An SVM isn&#x27;t magical in fact an SVM and neural network are equivalent with the SVM being the general case. SVMs suffer from the the same problems as neural networks in that your # Hidden Nodes&#x2F;Activation function is the same as trying to choose your kernel function. When looking at your problem you have to ask yourself.<p><i>1) Is the model time varying? -&gt; NN</i><p><i>2) Very large N dimensional search space -&gt; SVM</i><p>Secondly even without changing algorithms you can get a significant accuracy improvement by examining your features. Features are the most important part of machine learning (garbage-in&#x2F;garbage-out). Even simple classifiers such as Naive Bayes can do well if given the right feature set. There are multiple methods to examine your features such as ReliefF another is ANOVA. If you find your features are not good enough try unsupervised feature detection and learn more about the problem domain and coming up with your own features.<p>The final issue specifically with HPC and machine learning is even given 100 cores your algorithms may not speedup. Many machine learning algorithms are built to be iterative in nature and do not lend themselves to the becoming parallel. This means that MapReduce must be invoked at each iteration. As you scale up the number of available cores the overhead of startup and shutdown of your cluster at each iteration overrides your gain in performance, as many nodes will finish faster than others and just sit and wait.<p>The solution to all of this is simple<p>1) Get your features correct<p>2) Try new algorithms<p><pre><code> - Try online learning algorithm first like vowpal wabbit here https:&#x2F;&#x2F;github.com&#x2F;JohnLangford&#x2F;vowpal_wabbit&#x2F; </code></pre> 3) HPC<p><pre><code> - Apache Spark http:&#x2F;&#x2F;spark.incubator.apache.org&#x2F; or GraphLab http:&#x2F;&#x2F;graphlab.org&#x2F; or (if personal computer only) GraphChi http:&#x2F;&#x2F;graphlab.org&#x2F;graphchi&#x2F; - Both support HPC with graph centric framework - Orders of magnitude faster than Hadoop - Both built on top of Hadoop HDFS so connect and go </code></pre> Hope this helps Everyone out there.<p>Have fun and try to solve some cool problems.
评论 #6413660 未加载
pnachbaur超过 11 年前
I&#x27;m still surprised there is no SVM implementation for Mahout (<a href="http://stackoverflow.com/questions/10482646/recently-svm-implementation-was-added-into-mahout-i-am-planning-to-use-svm-an" rel="nofollow">http:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;10482646&#x2F;recently-svm-imp...</a>).
chcleaves超过 11 年前
A few years ago some SVM code was contributed to the Mahout project, but as of yet, it still doesn&#x27;t appear to have a working implementation. It seems one can tweak existing Mahout functions a bit in order to accomplish the same sort of thing, but Mike went ahead and started working on an SVM implementation when he initially discovered it wasn&#x27;t fully implemented. Given a package like sklearn (part of the Python scipy package), it&#x27;s not so hard to implement a scheme similar to the one described in the blog once you know what to do.
PaulHoule超过 11 年前
I&#x27;d like to see some classification performance numbers, ROC curves, etc.
eakyol超过 11 年前
Paul,<p>We don&#x27;t have any published performance numbers at the moment as we just implemented our cluster within our production environment. Looking to do a post-facto write up on that in a bit.
konstantintin超过 11 年前
how much of an increase in accuracy is gained by using the full collection of data rather than sampling?