科技回声

8 条评论

eranation将近 10 年前

Anyone who wants to pick up Spark basics - Berkeley (Spark was developed at Berkeley's AMPLab) in collaboration with DataBricks (Commercial company started by Spark creators) just started a free MOOC on edx: <a href="https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x" rel="nofollow">https://www.edx.org/course/introduction-big-data-apache-spar...</a>(If you wonder what is Spark, in a very unofficial nutshell - it is a computation / big data / analytics / machine learning / graph processing engine on top of Hadoop that usually performs much better and has arguably a much easier API in Python, Scala, Java and now R)It has more than 5000 students so far, and the Professor seems to answer every single Piazza question (a popular student / teacher message board).So far it looks really good (It started a week ago, so you can still catch up, 2nd lab is due only Friday 6/12 EOD, but you have 3 days "grace" period... and there is not too much to catch up)I use Spark for work (Scala API) and still learned one or two new things.It uses the PySpark API so no need to learn Scala. All homework labs are done in a iPython notebook. Very high quality so far IMHO.It is followed by a more advanced spark course (Scalable Machine Learning) also by Berkeley & Databricks.<a href="https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x" rel="nofollow">https://www.edx.org/course/scalable-machine-learning-uc-berk...</a>(not affiliated with edx, Berkeley or databricks, just thought it's a good place for a PSA to those interested)The Spark originating academic paper by Matei Zaharia (Creator of Spark) got him a PHd dissertation award in 2014 by the ACM (<a href="http://www.acm.org/press-room/news-releases/2015/dissertation-award-14/" rel="nofollow">http://www.acm.org/press-room/news-releases/2015/dissertatio...</a>)Spark also set a new record in large scale sorting (Beating Hadoop by far): <a href="https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html" rel="nofollow">https://databricks.com/blog/2014/11/05/spark-officially-sets...</a>* EDIT: typo in "Berkeley", thanks gboss for noticing :)

评论 #9704445 未加载

评论 #9703423 未加载

评论 #9703504 未加载

评论 #9704461 未加载

评论 #9708589 未加载

评论 #9703668 未加载

评论 #9704159 未加载

评论 #9736487 未加载

评论 #9702843 未加载

fleeno将近 10 年前

As someone who doesn't know what Apache Spark is, this article reads like it could have been randomly generated.

评论 #9702806 未加载

评论 #9705466 未加载

chiachun将近 10 年前

The release notes: <a href="https://spark.apache.org/releases/spark-release-1-4-0.html" rel="nofollow">https://spark.apache.org/releases/spark-release-1-4-0.html</a>Another major change is that it supports Python 3 now. <a href="https://issues.apache.org/jira/browse/SPARK-4897" rel="nofollow">https://issues.apache.org/jira/browse/SPARK-4897</a>

评论 #9703200 未加载

minimaxir将近 10 年前

I'm excited about SparkR, even though R is shunned in the field of big data. Between that and dplyr (which inspired the SparkR syntax) for data manipulation and sanitation, it should be much easier to write sane, reproducible code and visualizations for big data analysis. (the Python/Scala tutorials for Spark gave me a headache)SparkR appears to have strong integration into Rstudio, which is big news: <a href="http://blog.rstudio.org/2015/05/28/sparkr-preview-by-vincent-warmerdam/" rel="nofollow">http://blog.rstudio.org/2015/05/28/sparkr-preview-by-vincent...</a>

评论 #9703629 未加载

评论 #9702773 未加载

评论 #9702511 未加载

评论 #9701814 未加载

DannoHung将近 10 年前

Does anyone know if there's a guide to integrating Spark between a realtime write only database and a historical database?I've looked into using Spark Streaming, but I can't work out how you could seamlessly transition data from a streaming batch to the historical db in a reasonably tight time period.I'd be willing to pay for training if it came to it, but I don't think I'm using the right search terms.

评论 #9702031 未加载

评论 #9702772 未加载

评论 #9702744 未加载

评论 #9702838 未加载

评论 #9703795 未加载

评论 #9702462 未加载

krat0sprakhar将近 10 年前

I, somehow, always keep getting confused between Spark and Storm! Can someone explain the difference between the two (usecases etc.) as if explaining to a five year-old? Thanks!

评论 #9705560 未加载

lazzlazzlazz将近 10 年前

Is support for User Defined Aggregation Functions (regarding DataFrames) slated for 1.5?

Tepix将近 10 年前

Too bad the website is so hard to read.Time for that site to join contrastrebellion.com

8 条评论

eranation将近 10 年前

评论 #9704445 未加载

评论 #9703423 未加载

评论 #9703504 未加载

评论 #9704461 未加载

评论 #9708589 未加载

评论 #9703668 未加载

评论 #9704159 未加载

评论 #9736487 未加载

评论 #9702843 未加载

fleeno将近 10 年前

As someone who doesn't know what Apache Spark is, this article reads like it could have been randomly generated.

评论 #9702806 未加载

评论 #9705466 未加载

chiachun将近 10 年前

评论 #9703200 未加载

minimaxir将近 10 年前

评论 #9703629 未加载

评论 #9702773 未加载

评论 #9702511 未加载

评论 #9701814 未加载

DannoHung将近 10 年前

评论 #9702031 未加载

评论 #9702772 未加载

评论 #9702744 未加载

评论 #9702838 未加载

评论 #9703795 未加载

评论 #9702462 未加载

krat0sprakhar将近 10 年前

I, somehow, always keep getting confused between Spark and Storm! Can someone explain the difference between the two (usecases etc.) as if explaining to a five year-old? Thanks!

评论 #9705560 未加载

lazzlazzlazz将近 10 年前

Is support for User Defined Aggregation Functions (regarding DataFrames) slated for 1.5?

Tepix将近 10 年前

Too bad the website is so hard to read.Time for that site to join contrastrebellion.com

Announcing Apache Spark 1.4

8 条评论

Announcing Apache Spark 1.4

8 条评论