TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Spark as a Compiler: Joining a Billion Rows per Second on a Laptop

251 点作者 rxin将近 9 年前

7 条评论

rusanu将近 9 年前
To put this into context I would recommend reading &#x27;MonetDB&#x2F;X100: Hyper-Pipelining Query Execution&#x27; [0]. Vectorized execution has been sort of an open secret in database industry for quite some time now.<p>For me, is particularly interesting reading the Spark achievements. I was part of the similar Hive effort (the Stinger initiative [1]) and I contributed some parts of the Hive vectorized execution [2]. I see the same solution that applied to Hive now applies to Spark:<p>- move to a columnar, highly compressed storage format (Parquet, for Hive it was ORC)<p>- implement a vectorized execution engine<p>- code generation instead of plan interpretation. This is particularly interesting for me because for Hive this was discussed then and actually <i>not</i> adopted (ORC and vectorized execution had, justifiably, bigger priority).<p>Looking at the numbers presented in OP, it looks very nice. Aggregates, Filters, Sort, Scan (&#x27;decoding&#x27;) show big improvement (I would expected these, is exactly what vectorized execution is best at). I like that Hash-Join also shows significant improvement, is obvious their implementation is better than the HIVE-4850 I did, of which I&#x27;m not too proud. The SM&#x2F;SMB join is not affected, no surprise there.<p>I would like to see a separation of how much of the improvement comes from vectorization vs. how much from code generation. I get the feeling that the way they did it these cannot be separated. I think there is no vectorized plan&#x2F;operators to compare against the code generation, they implemented both simultaneously. I&#x27;m speculating, but I guess the new whole-stage code generation it generates vectorized code, so there is no vectorized execution w&#x2F;o code generation.<p>All in all, congrats to the DataBricks team. This will have a big impact.<p>[0] <a href="http:&#x2F;&#x2F;oai.cwi.nl&#x2F;oai&#x2F;asset&#x2F;16497&#x2F;16497B.pdf" rel="nofollow">http:&#x2F;&#x2F;oai.cwi.nl&#x2F;oai&#x2F;asset&#x2F;16497&#x2F;16497B.pdf</a> [1] <a href="http:&#x2F;&#x2F;hortonworks.com&#x2F;blog&#x2F;100x-faster-hive&#x2F;" rel="nofollow">http:&#x2F;&#x2F;hortonworks.com&#x2F;blog&#x2F;100x-faster-hive&#x2F;</a> [2] <a href="https:&#x2F;&#x2F;issues.apache.org&#x2F;jira&#x2F;browse&#x2F;HIVE-4160" rel="nofollow">https:&#x2F;&#x2F;issues.apache.org&#x2F;jira&#x2F;browse&#x2F;HIVE-4160</a>
评论 #11757383 未加载
评论 #11756498 未加载
评论 #11756416 未加载
eggy将近 9 年前
&gt;&gt; This style of processing, invented by columnar database systems such as MonetDB and C-Store<p>True, since MonetDB came of out of the Netherlands around 1993, but KDB+ came out in 1998, well over 8 years before C-Store.<p>K4 the language used in KDB is 230 times faster than Spark&#x2F;shark and uses 0.2GB of RAM vs. 50GB of RAM for Spark&#x2F;shark, yet no mention in the article. It seems a strange omission for such a sensational sounding title [1]. I don&#x27;t understand why big data startups don&#x27;t try and remake the success of KDB instead of reinventing bits and pieces of the same tech with a result in slower DB operations and more RAM usage.<p>[1] <a href="http:&#x2F;&#x2F;kparc.com&#x2F;q4&#x2F;readme.txt" rel="nofollow">http:&#x2F;&#x2F;kparc.com&#x2F;q4&#x2F;readme.txt</a>
评论 #11759108 未加载
评论 #11762075 未加载
mastratton3将近 9 年前
I&#x27;ve been personally very impressed with Spark&#x27;s RDD api for easily parallelizing tasks that are &quot;embarrassingly parallel&quot;. However, I have found the data frames API to not always work as advertised and thus I&#x27;m very skeptical of the benchmarks.<p>I think a prime example of this is I was using some very basic windowing functions and due to the data shuffling (The data wasn&#x27;t naturally partitioned) it seemed to be very buggy and not very clear why stuff was failing. I ended up rewriting the same section of code using hive and it had both better performance and didn&#x27;t seem to have any odd failures. I realize this stuff will improve but I&#x27;m still skeptical.
评论 #11755547 未加载
评论 #11756033 未加载
dswalter将近 9 年前
It&#x27;s interesting to see that as further work is done on spark (and I&#x27;m pleased they&#x27;re actually improving the system), it behaves more and more like a database.
foota将近 9 年前
Reading about databases always makes me sad about the enterprise database I work with.
评论 #11755516 未加载
评论 #11755418 未加载
评论 #11756197 未加载
评论 #11755776 未加载
jkot将近 9 年前
Typesafe plans to add support for vectorization into Scala compiler for Akka. Also JVM 8 JIT compiler does it to some extend already.
评论 #11758485 未加载
评论 #11757251 未加载
faizshah将近 9 年前
I&#x27;ll have to benchmark Spark 2.0 against Flink, it seems like it could be faster than Flink now. It does depend on if the dataset they used was living in memory before they ran the benchmark, but it&#x27;s still some pretty impressive numbers and some of the optimizations they made with Tungsten sound similar to what Flink was doing.
评论 #11756037 未加载