科技回声

9 条评论

mb22超过 9 年前

We've been testing 1.6 since before release, specifically SparkSQL and there are some big performance improvements in this release. We're putting together a 3rd party benchmark I'll post to HN when we are done.

minimaxir超过 9 年前

It sounds like many of the improvements are not available on PySpark yet, which is disappointing. (Notes say feature parity for MLIB, so I'll look into that) However, the notes sound promising.

评论 #10838559 未加载

评论 #10838058 未加载

kod超过 9 年前

So the detailed post on datasets at <a href="https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html" rel="nofollow">https://databricks.com/blog/2016/01/04/introducing-spark-dat...</a>uses groupByI'm pretty sure based on previous comments you've made that groupBy was one of the things you'd rather eliminate from the RDD api, because of the performance impact compared to reduceByKey (which is almost always what people should be using instead).Are you at all worried about confusion if groupBy now performs ok on datasets, but not on rdds?

评论 #10838740 未加载

评论 #10838787 未加载

_laf超过 9 年前

This post: <a href="https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html" rel="nofollow">https://databricks.com/blog/2015/07/30/diving-into-spark-str...</a>In the section titled "Future Directions for Spark Streaming" there is a paragraph about Event time and out-of-order data and Backpressure. This would blow my mind to be able to use; this is a real pain currently.

评论 #10840063 未加载

mziel超过 9 年前

New statistics/machine learning algorithms are always welcome, but the big plus for productionizing is ML pipelines persistence.That said, during Spark Summit Databricks guys themselves were most excited about Dataset API. Looking forward to giving it a try.

mark_l_watson超过 9 年前

Great news, Spark is awesome. Only problem is that I now need to review my Spark material for an eBook that I released a month ago and update the examples to work on version 1.6, if required.

DannoHung超过 9 年前

Any support for nearest neighbor joins? They are very important for aligning events in time series data sets.

huula超过 9 年前

Love Spark! great works, folks!

ranjeet_hacker超过 9 年前

Excited about dataset api and ML pipeline.

9 条评论

mb22超过 9 年前

minimaxir超过 9 年前

It sounds like many of the improvements are not available on PySpark yet, which is disappointing. (Notes say feature parity for MLIB, so I'll look into that) However, the notes sound promising.

评论 #10838559 未加载

评论 #10838058 未加载

kod超过 9 年前

评论 #10838740 未加载

评论 #10838787 未加载

_laf超过 9 年前

评论 #10840063 未加载

mziel超过 9 年前

mark_l_watson超过 9 年前

Great news, Spark is awesome. Only problem is that I now need to review my Spark material for an eBook that I released a month ago and update the examples to work on version 1.6, if required.

DannoHung超过 9 年前

Any support for nearest neighbor joins? They are very important for aligning events in time series data sets.

huula超过 9 年前

Love Spark! great works, folks!

ranjeet_hacker超过 9 年前

Excited about dataset api and ML pipeline.

Announcing Spark 1.6

9 条评论

Announcing Spark 1.6

9 条评论