科技回声

10 条评论

Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.

评论 #13610191 未加载

评论 #13612473 未加载

评论 #13610705 未加载

评论 #13612402 未加载

filereaper超过 8 年前

I apologize in advance, but whenever people claim to use a in-memory big-data system, how exactly does this end up working?You can only stuff so much into memory, so you can scale up vertically in-terms of memory, unless you buy a massive big-iron POWER box, you scale out horizontally. But with each of these in-memory appliances, what happens when you need to spill out to disk?In essence why should one bother with these in-memory appliances as opposed to buying boxes with fast SSD's instead? Sure you spill out to disk, but do you take that big of a hit compared to the enormous cost of keeping everything in memory?

评论 #13611669 未加载

评论 #13611601 未加载

评论 #13611527 未加载

usgroup超过 8 年前

Lol was hoping it was a combination of awk and paste :)That always makes me chuckle.Honestly though ... Jenkins + bash + cloud storage and you'll be surprised at how many big data problems you can solve with a fraction of the complexity.

评论 #13614351 未加载

评论 #13611314 未加载

EGreg超过 8 年前

This seems like impressive stats about a relational database technology. But the scrolling on their website doesn't work on mobile. So in grand HN tradition, I left and now tell you all about it here, instead of the main point of their invention :)

评论 #13611353 未加载

Loic超过 8 年前

What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?

评论 #13610731 未加载

alexchamberlain超过 8 年前

Python 2.7 can do it in 0.0867 usec (Intel i7);<pre><code> $ python2.7 -m timeit 'n=10**9; (n*n + n) / 2' 10000000 loops, best of 3: 0.0867 usec per loop </code></pre> (Admittedly, I killed `n=109; sum(range(1,n+1))`.)

marknadal超过 8 年前

Great article, actually. Typical HN comments on performance optimizations are complaints like "this isn't a real world use case" or things like that. Most of which, they miss that comparing baseline performance metrics against two systems is still genuinely interesting in and of by itself, and acts as a huge learning catalyst to understanding what is going on. I think this article did a great job of making an honest comparison and discussing what is going on, so props to the team! (We did something similar as well, where we compared cached read performance against Redis, and were 50X faster - here: <a href="https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2GB-Atom-CPU" rel="nofollow">https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2...</a> ).

评论 #13610620 未加载

Bedon292超过 8 年前

I know its just a benchmark for comparison, and it is awesome. I love seeing cool comparisons like this, but why do I care that this particular benchmark is faster than Spark? What sort of analytics will be affected by this improvement, and will it actually be saving me time on real world use cases?

评论 #13610923 未加载

评论 #13612737 未加载

评论 #13610834 未加载

supergirl超过 8 年前

why would you choose values between 1 and 1000 for the right side? why not 1000 values between 1 and 1 billion?

zzleeper超过 8 年前

In case the author reads this: I can't read well with that font, unless I zoom in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)

评论 #13609960 未加载

10 条评论

banachtarski超过 8 年前

Am I reading this correctly? The testbed was a single laptop? A big part of spark is the distributed in-memory aspect so I'm not sure I understand why any of these numbers mean anything.

评论 #13610191 未加载

评论 #13612473 未加载

评论 #13610705 未加载

评论 #13612402 未加载

filereaper超过 8 年前

评论 #13611669 未加载

评论 #13611601 未加载

评论 #13611527 未加载

usgroup超过 8 年前

评论 #13614351 未加载

评论 #13611314 未加载

EGreg超过 8 年前

评论 #13611353 未加载

Loic超过 8 年前

What is the algorithm used to join the tables? Is it a hash join on `id` and `k` or using the fact that the ids are sorted and using a kind of galloping approach?

评论 #13610731 未加载

alexchamberlain超过 8 年前

marknadal超过 8 年前

评论 #13610620 未加载

Bedon292超过 8 年前

评论 #13610923 未加载

评论 #13612737 未加载

评论 #13610834 未加载

supergirl超过 8 年前

why would you choose values between 1 and 1000 for the right side? why not 1000 values between 1 and 1 billion?

zzleeper超过 8 年前

In case the author reads this: I can't read well with that font, unless I zoom in all the way. Doesn't happen with anything else (Win10, 14in laptop, Chrome)

评论 #13609960 未加载

Joining a billion rows 20x faster than Apache Spark

10 条评论

Joining a billion rows 20x faster than Apache Spark

10 条评论