TechEcho

It's not clear <i>how much</i> of these improvements are from reductions in pricing rather than algorithms and design decisions. They've documented things like using Netty for network latency, avoiding GC, and getting better with Spark, but it'd be interesting if the team could go back and run the benchmark using the same infrastructure as their 2014 benchmark for a code-vs-code comparison to separate engineering improvements from economies of scale.

Meanwhile, google was sorting Petabytes in under a minute on their clusters 6+ years ago. We've still got a long ways to go in OSS land to compete with the big boys.

A price record not a performance one.<p>Also, seeing how expensive it is to sort 100TB ($144) you have to wonder why it wouldn't be better to do it on your own hardware.

I got excited and then I saw that this was for sorting not storage...

Meanwhile, google was sorting Petabytes in under a minute on their clusters 6+ years ago. We've still got a long ways to go in OSS land to compete with the big boys.

A price record not a performance one.<p>Also, seeing how expensive it is to sort 100TB ($144) you have to wonder why it wouldn't be better to do it on your own hardware.

I got excited and then I saw that this was for sorting not storage...

Setting a new world record in CloudSort with Apache Spark

4 comments

Setting a new world record in CloudSort with Apache Spark

4 comments