TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Announcing Spark 1.6

104 点作者 rxin超过 9 年前

9 条评论

mb22超过 9 年前
We've been testing 1.6 since before release, specifically SparkSQL and there are some big performance improvements in this release. We're putting together a 3rd party benchmark I'll post to HN when we are done.
minimaxir超过 9 年前
It sounds like many of the improvements are not available on PySpark yet, which is disappointing. (Notes say feature parity for MLIB, so I'll look into that) However, the notes sound promising.
评论 #10838559 未加载
评论 #10838058 未加载
kod超过 9 年前
So the detailed post on datasets at <a href="https:&#x2F;&#x2F;databricks.com&#x2F;blog&#x2F;2016&#x2F;01&#x2F;04&#x2F;introducing-spark-datasets.html" rel="nofollow">https:&#x2F;&#x2F;databricks.com&#x2F;blog&#x2F;2016&#x2F;01&#x2F;04&#x2F;introducing-spark-dat...</a><p>uses groupBy<p>I&#x27;m pretty sure based on previous comments you&#x27;ve made that groupBy was one of the things you&#x27;d rather eliminate from the RDD api, because of the performance impact compared to reduceByKey (which is almost always what people should be using instead).<p>Are you at all worried about confusion if groupBy now performs ok on datasets, but not on rdds?
评论 #10838740 未加载
评论 #10838787 未加载
_laf超过 9 年前
This post: <a href="https:&#x2F;&#x2F;databricks.com&#x2F;blog&#x2F;2015&#x2F;07&#x2F;30&#x2F;diving-into-spark-streamings-execution-model.html" rel="nofollow">https:&#x2F;&#x2F;databricks.com&#x2F;blog&#x2F;2015&#x2F;07&#x2F;30&#x2F;diving-into-spark-str...</a><p>In the section titled &quot;Future Directions for Spark Streaming&quot; there is a paragraph about <i>Event time and out-of-order data</i> and <i>Backpressure</i>. This would blow my mind to be able to use; this is a real pain currently.
评论 #10840063 未加载
mziel超过 9 年前
New statistics&#x2F;machine learning algorithms are always welcome, but the big plus for productionizing is ML pipelines persistence.<p>That said, during Spark Summit Databricks guys themselves were most excited about Dataset API. Looking forward to giving it a try.
mark_l_watson超过 9 年前
Great news, Spark is awesome. Only problem is that I now need to review my Spark material for an eBook that I released a month ago and update the examples to work on version 1.6, if required.
DannoHung超过 9 年前
Any support for nearest neighbor joins? They are very important for aligning events in time series data sets.
huula超过 9 年前
Love Spark! great works, folks!
ranjeet_hacker超过 9 年前
Excited about dataset api and ML pipeline.