TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Apache Spark 1.0.0

157 pointsby stevebalmost 11 years ago

8 comments

MoOmeralmost 11 years ago
For new entrants, here&#x27;s an email I sent out to some colleagues of mine just getting into ML. I&#x27;m wrapping up a project that&#x27;s using Mahout, and am getting into Spark &amp; MLlib now. I&#x27;ve regurgitated this on reddit already.<p>I&#x27;ve been following Apache Spark [0], a new-ish Apache project created by UC Berkeley to replace Hadoop MapReduce [1], for about a month now; and, I finally got around to spending some time with it last night and earllllllly this morning.<p>Added into the Spark mix about a year ago was a strong Machine Learning library (MLlib) [2] similar to Mahout [3] that promises much better performance (comparable&#x2F;better than Matlab [4]&#x2F;Vowpal Wabbit [5])<p>MLlib is a lower level library, which offers a lot of control&#x2F;power for developers. However, Berkeley&#x27;s Amplab has also created a higher level abstraction layer for end users called MLI [6]. It&#x27;s still being actively developed, and although updates are in the works, they haven&#x27;t been made available to the public repository for a while [7]<p>Check out an introduction to the MLlib on youtube here: <a href="https://www.youtube.com/watch?v=IxDnF_X4M-8" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=IxDnF_X4M-8</a><p>Getting up to speed with Spark itself is really pain-free compared to some tools like Mahout etc. There&#x27;s a quick-start guide for Scala [8], a getting started guide for Spark [9], and lots of other learning&#x2F;community resources available for Spark [10] [11]<p>[0] <a href="http://spark.apache.org/" rel="nofollow">http:&#x2F;&#x2F;spark.apache.org&#x2F;</a><p>[1] <a href="http://hadoop.apache.org/" rel="nofollow">http:&#x2F;&#x2F;hadoop.apache.org&#x2F;</a><p>[2] <a href="http://spark.apache.org/mllib/" rel="nofollow">http:&#x2F;&#x2F;spark.apache.org&#x2F;mllib&#x2F;</a><p>[3] <a href="https://mahout.apache.org/" rel="nofollow">https:&#x2F;&#x2F;mahout.apache.org&#x2F;</a><p>[4] <a href="http://www.mathworks.com/products/matlab/" rel="nofollow">http:&#x2F;&#x2F;www.mathworks.com&#x2F;products&#x2F;matlab&#x2F;</a><p>[5] <a href="https://github.com/JohnLangford/vowpal_wabbit/wiki" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;JohnLangford&#x2F;vowpal_wabbit&#x2F;wiki</a><p>[6] <a href="http://www.mlbase.org/" rel="nofollow">http:&#x2F;&#x2F;www.mlbase.org&#x2F;</a><p>[7] <a href="http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-MLI-td3610.html" rel="nofollow">http:&#x2F;&#x2F;apache-spark-user-list.1001560.n3.nabble.com&#x2F;Status-o...</a><p>[8] www.artima.com&#x2F;scalazine&#x2F;articles&#x2F;steps.html<p>[9] <a href="http://spark.apache.org/docs/latest/quick-start.html" rel="nofollow">http:&#x2F;&#x2F;spark.apache.org&#x2F;docs&#x2F;latest&#x2F;quick-start.html</a><p>[10] <a href="http://ampcamp.berkeley.edu/4/exercises/" rel="nofollow">http:&#x2F;&#x2F;ampcamp.berkeley.edu&#x2F;4&#x2F;exercises&#x2F;</a><p>[11] <a href="https://spark.apache.org/community.html" rel="nofollow">https:&#x2F;&#x2F;spark.apache.org&#x2F;community.html</a>
krallinalmost 11 years ago
Note that Spark 1.0.0 makes it possible to trivially submit spark jobs to an existing Hadoop cluster.<p>It leverages HDFS to distribute archives (e.g. your app JAR) and store results &#x2F; state &#x2F; logs, and YARN to schedule itself and acquire compute resources.<p>It&#x27;s pretty amazing to see how you use Spark&#x27;s API to write functional applications that are then distributed across multiple executors (e.g. when you use Spark&#x27;s &quot;filter&quot; or a &quot;map&quot; operations, then the operation potentially gets distributed and distributed on totally different nodes).<p>Great tool — exciting to see it reach 1.0.0!
评论 #7822741 未加载
stevebalmost 11 years ago
I gave a 30-minute overview of Spark yesterday at StampedeCon. Spark is generating a lot of excitement in the big data community:<p><a href="https://speakerdeck.com/stevendborrelli/introduction-to-apache-spark" rel="nofollow">https:&#x2F;&#x2F;speakerdeck.com&#x2F;stevendborrelli&#x2F;introduction-to-apac...</a>
eranationalmost 11 years ago
I wonder if anyone with experience with Spark can comment &#x2F; rebut this post: <a href="http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html" rel="nofollow">http:&#x2F;&#x2F;blog.explainmydata.com&#x2F;2014&#x2F;05&#x2F;spark-should-be-better...</a>
评论 #7822675 未加载
评论 #7822915 未加载
评论 #7825585 未加载
agibsoncccalmost 11 years ago
Spark is an interesting technology, from what I&#x27;ve heard it doesn&#x27;t actually have traction in industry yet though.<p>Anyone here actually using it in production? I know it&#x27;s blazing fast etc, and I like it as a map reduce replacement. It has all the makings of a great distributed system, I&#x27;m still waiting to see a major deployment yet..
评论 #7822506 未加载
评论 #7824409 未加载
评论 #7823065 未加载
评论 #7822532 未加载
评论 #7824968 未加载
kovrikalmost 11 years ago
Any active Clojure bindings?<p>clj-spark seems to be abandoned (last commit was a year ago)...
评论 #7822214 未加载
alexatkeplaralmost 11 years ago
Great to see Spark hitting 1.0.0. You can actually run Spark on Elastic MapReduce pretty easily - check out our tutorial project for how: <a href="https://github.com/snowplow/spark-example-project" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;snowplow&#x2F;spark-example-project</a>
Nasiruddinalmost 11 years ago
Great...new era of distributed computing