TechEcho

8 comments

MoOmeralmost 11 years ago

For new entrants, here's an email I sent out to some colleagues of mine just getting into ML. I'm wrapping up a project that's using Mahout, and am getting into Spark & MLlib now. I've regurgitated this on reddit already.I've been following Apache Spark [0], a new-ish Apache project created by UC Berkeley to replace Hadoop MapReduce [1], for about a month now; and, I finally got around to spending some time with it last night and earllllllly this morning.Added into the Spark mix about a year ago was a strong Machine Learning library (MLlib) [2] similar to Mahout [3] that promises much better performance (comparable/better than Matlab [4]/Vowpal Wabbit [5])MLlib is a lower level library, which offers a lot of control/power for developers. However, Berkeley's Amplab has also created a higher level abstraction layer for end users called MLI [6]. It's still being actively developed, and although updates are in the works, they haven't been made available to the public repository for a while [7]Check out an introduction to the MLlib on youtube here: <a href="https://www.youtube.com/watch?v=IxDnF_X4M-8" rel="nofollow">https://www.youtube.com/watch?v=IxDnF_X4M-8</a>Getting up to speed with Spark itself is really pain-free compared to some tools like Mahout etc. There's a quick-start guide for Scala [8], a getting started guide for Spark [9], and lots of other learning/community resources available for Spark [10] [11][0] <a href="http://spark.apache.org/" rel="nofollow">http://spark.apache.org/</a>[1] <a href="http://hadoop.apache.org/" rel="nofollow">http://hadoop.apache.org/</a>[2] <a href="http://spark.apache.org/mllib/" rel="nofollow">http://spark.apache.org/mllib/</a>[3] <a href="https://mahout.apache.org/" rel="nofollow">https://mahout.apache.org/</a>[4] <a href="http://www.mathworks.com/products/matlab/" rel="nofollow">http://www.mathworks.com/products/matlab/</a>[5] <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki" rel="nofollow">https://github.com/JohnLangford/vowpal_wabbit/wiki</a>[6] <a href="http://www.mlbase.org/" rel="nofollow">http://www.mlbase.org/</a>[7] <a href="http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-td3610.html" rel="nofollow">http://apache-spark-user-list.1001560.n3.nabble.com/Status-o...</a>[8] www.artima.com/scalazine/articles/steps.html[9] <a href="http://spark.apache.org/docs/latest/quick-start.html" rel="nofollow">http://spark.apache.org/docs/latest/quick-start.html</a>[10] <a href="http://ampcamp.berkeley.edu/4/exercises/" rel="nofollow">http://ampcamp.berkeley.edu/4/exercises/</a>[11] <a href="https://spark.apache.org/community.html" rel="nofollow">https://spark.apache.org/community.html</a>

krallinalmost 11 years ago

Note that Spark 1.0.0 makes it possible to trivially submit spark jobs to an existing Hadoop cluster.It leverages HDFS to distribute archives (e.g. your app JAR) and store results / state / logs, and YARN to schedule itself and acquire compute resources.It's pretty amazing to see how you use Spark's API to write functional applications that are then distributed across multiple executors (e.g. when you use Spark's "filter" or a "map" operations, then the operation potentially gets distributed and distributed on totally different nodes).Great tool — exciting to see it reach 1.0.0!

评论 #7822741 未加载

stevebalmost 11 years ago

I gave a 30-minute overview of Spark yesterday at StampedeCon. Spark is generating a lot of excitement in the big data community:<a href="https://speakerdeck.com/stevendborrelli/introduction-to-apache-spark" rel="nofollow">https://speakerdeck.com/stevendborrelli/introduction-to-apac...</a>

eranationalmost 11 years ago

I wonder if anyone with experience with Spark can comment / rebut this post: <a href="http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html" rel="nofollow">http://blog.explainmydata.com/2014/05/spark-should-be-better...</a>

评论 #7822675 未加载

评论 #7822915 未加载

评论 #7825585 未加载

agibsoncccalmost 11 years ago

Spark is an interesting technology, from what I've heard it doesn't actually have traction in industry yet though.Anyone here actually using it in production? I know it's blazing fast etc, and I like it as a map reduce replacement. It has all the makings of a great distributed system, I'm still waiting to see a major deployment yet..

评论 #7822506 未加载

评论 #7824409 未加载

评论 #7823065 未加载

评论 #7822532 未加载

评论 #7824968 未加载

kovrikalmost 11 years ago

Any active Clojure bindings?clj-spark seems to be abandoned (last commit was a year ago)...

评论 #7822214 未加载

alexatkeplaralmost 11 years ago

Great to see Spark hitting 1.0.0. You can actually run Spark on Elastic MapReduce pretty easily - check out our tutorial project for how: <a href="https://github.com/snowplow/spark-example-project" rel="nofollow">https://github.com/snowplow/spark-example-project</a>

Nasiruddinalmost 11 years ago

Great...new era of distributed computing

8 comments

MoOmeralmost 11 years ago

krallinalmost 11 years ago

评论 #7822741 未加载

stevebalmost 11 years ago

eranationalmost 11 years ago

评论 #7822675 未加载

评论 #7822915 未加载

评论 #7825585 未加载

agibsoncccalmost 11 years ago

评论 #7822506 未加载

评论 #7824409 未加载

评论 #7823065 未加载

评论 #7822532 未加载

评论 #7824968 未加载

kovrikalmost 11 years ago

Any active Clojure bindings?clj-spark seems to be abandoned (last commit was a year ago)...

评论 #7822214 未加载

alexatkeplaralmost 11 years ago

Nasiruddinalmost 11 years ago

Great...new era of distributed computing

Apache Spark 1.0.0

8 comments

Apache Spark 1.0.0

8 comments