TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Joining a billion rows 20x faster than Apache Spark

153 点作者 plamb超过 8 年前

10 条评论

banachtarski超过 8 年前
Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.
评论 #13610191 未加载
评论 #13612473 未加载
评论 #13610705 未加载
评论 #13612402 未加载
filereaper超过 8 年前
I apologize in advance, but whenever people claim to use a in-memory big-data system, how exactly does this end up working?<p>You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?<p>In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD&#x27;s instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?
评论 #13611669 未加载
评论 #13611601 未加载
评论 #13611527 未加载
usgroup超过 8 年前
Lol was hoping it was a combination of awk and paste :)<p>That always makes me chuckle.<p>Honestly though ... Jenkins + bash + cloud storage and you&#x27;ll be surprised at how many big data problems you can solve with a fraction of the complexity.
评论 #13614351 未加载
评论 #13611314 未加载
EGreg超过 8 年前
This seems like impressive stats about a relational database technology. But the scrolling on their website doesn&#x27;t work on mobile. So in grand HN tradition, I left and now tell you all about it here, instead of the main point of their invention :)
评论 #13611353 未加载
Loic超过 8 年前
What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?
评论 #13610731 未加载
alexchamberlain超过 8 年前
Python 2.7 can do it in 0.0867 usec (Intel i7);<p><pre><code> $ python2.7 -m timeit &#x27;n=10**9; (n*n + n) &#x2F; 2&#x27; 10000000 loops, best of 3: 0.0867 usec per loop </code></pre> (Admittedly, I killed `n=10<i></i>9; sum(range(1,n+1))`.)
marknadal超过 8 年前
Great article, actually. Typical HN comments on performance optimizations are complaints like &quot;this isn&#x27;t a real world use case&quot; or things like that. Most of which, they miss that comparing baseline performance metrics against two systems is still genuinely interesting in and of by itself, and acts as a huge learning catalyst to understanding what is going on. I think this article did a great job of making an honest comparison and discussing what is going on, so props to the team! (We did something similar as well, where we compared cached read performance against Redis, and were 50X faster - here: <a href="https:&#x2F;&#x2F;github.com&#x2F;amark&#x2F;gun&#x2F;wiki&#x2F;100000-ops-sec-in-IE6-on-2GB-Atom-CPU" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;amark&#x2F;gun&#x2F;wiki&#x2F;100000-ops-sec-in-IE6-on-2...</a> ).
评论 #13610620 未加载
Bedon292超过 8 年前
I know its just a benchmark for comparison, and it is awesome. I love seeing cool comparisons like this, but why do I care that this particular benchmark is faster than Spark? What sort of analytics will be affected by this improvement, and will it actually be saving me time on real world use cases?
评论 #13610923 未加载
评论 #13612737 未加载
评论 #13610834 未加载
supergirl超过 8 年前
why would you choose values between 1 and 1000 for the right side? why not 1000 values between 1 and 1 billion?
zzleeper超过 8 年前
In case the author reads this: I can&#x27;t read well with that font, unless I zoom in all the way. Doesn&#x27;t happen with anything else (Win10, 14in laptop, Chrome)
评论 #13609960 未加载