For new entrants, here's an email I sent out to some colleagues of mine just getting into ML. I'm wrapping up a project that's using Mahout, and am getting into Spark & MLlib now. I've regurgitated this on reddit already.<p>I've been following Apache Spark [0], a new-ish Apache project created by UC Berkeley to replace Hadoop MapReduce [1], for about a month now; and, I finally got around to spending some time with it last night and earllllllly this morning.<p>Added into the Spark mix about a year ago was a strong Machine Learning library (MLlib) [2] similar to Mahout [3] that promises much better performance (comparable/better than Matlab [4]/Vowpal Wabbit [5])<p>MLlib is a lower level library, which offers a lot of control/power for developers. However, Berkeley's Amplab has also created a higher level abstraction layer for end users called MLI [6]. It's still being actively developed, and although updates are in the works, they haven't been made available to the public repository for a while [7]<p>Check out an introduction to the MLlib on youtube here: <a href="https://www.youtube.com/watch?v=IxDnF_X4M-8" rel="nofollow">https://www.youtube.com/watch?v=IxDnF_X4M-8</a><p>Getting up to speed with Spark itself is really pain-free compared to some tools like Mahout etc. There's a quick-start guide for Scala [8], a getting started guide for Spark [9], and lots of other learning/community resources available for Spark [10] [11]<p>[0] <a href="http://spark.apache.org/" rel="nofollow">http://spark.apache.org/</a><p>[1] <a href="http://hadoop.apache.org/" rel="nofollow">http://hadoop.apache.org/</a><p>[2] <a href="http://spark.apache.org/mllib/" rel="nofollow">http://spark.apache.org/mllib/</a><p>[3] <a href="https://mahout.apache.org/" rel="nofollow">https://mahout.apache.org/</a><p>[4] <a href="http://www.mathworks.com/products/matlab/" rel="nofollow">http://www.mathworks.com/products/matlab/</a><p>[5] <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki" rel="nofollow">https://github.com/JohnLangford/vowpal_wabbit/wiki</a><p>[6] <a href="http://www.mlbase.org/" rel="nofollow">http://www.mlbase.org/</a><p>[7] <a href="http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-td3610.html" rel="nofollow">http://apache-spark-user-list.1001560.n3.nabble.com/Status-o...</a><p>[8] www.artima.com/scalazine/articles/steps.html<p>[9] <a href="http://spark.apache.org/docs/latest/quick-start.html" rel="nofollow">http://spark.apache.org/docs/latest/quick-start.html</a><p>[10] <a href="http://ampcamp.berkeley.edu/4/exercises/" rel="nofollow">http://ampcamp.berkeley.edu/4/exercises/</a><p>[11] <a href="https://spark.apache.org/community.html" rel="nofollow">https://spark.apache.org/community.html</a>
Note that Spark 1.0.0 makes it possible to trivially submit spark jobs to an existing Hadoop cluster.<p>It leverages HDFS to distribute archives (e.g. your app JAR) and store results / state / logs, and YARN to schedule itself and acquire compute resources.<p>It's pretty amazing to see how you use Spark's API to write functional applications that are then distributed across multiple executors (e.g. when you use Spark's "filter" or a "map" operations, then the operation potentially gets distributed and distributed on totally different nodes).<p>Great tool — exciting to see it reach 1.0.0!
I gave a 30-minute overview of Spark yesterday at StampedeCon. Spark is generating a lot of excitement in the big data community:<p><a href="https://speakerdeck.com/stevendborrelli/introduction-to-apache-spark" rel="nofollow">https://speakerdeck.com/stevendborrelli/introduction-to-apac...</a>
I wonder if anyone with experience with Spark can comment / rebut this post: <a href="http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html" rel="nofollow">http://blog.explainmydata.com/2014/05/spark-should-be-better...</a>
Spark is an interesting technology, from what I've heard it doesn't actually have traction in industry yet though.<p>Anyone here actually using it in production? I know it's blazing fast etc, and I like it as a map reduce replacement. It has all the makings of a great distributed system, I'm still waiting to see a major deployment yet..
Great to see Spark hitting 1.0.0. You can actually run Spark on Elastic MapReduce pretty easily - check out our tutorial project for how: <a href="https://github.com/snowplow/spark-example-project" rel="nofollow">https://github.com/snowplow/spark-example-project</a>