To put this into context I would recommend reading 'MonetDB/X100: Hyper-Pipelining Query Execution' [0]. Vectorized execution has been sort of an open secret in database industry for quite some time now.<p>For me, is particularly interesting reading the Spark achievements. I was part of the similar Hive effort (the Stinger initiative [1]) and I contributed some parts of the Hive vectorized execution [2]. I see the same solution that applied to Hive now applies to Spark:<p>- move to a columnar, highly compressed storage format (Parquet, for Hive it was ORC)<p>- implement a vectorized execution engine<p>- code generation instead of plan interpretation. This is particularly interesting for me because for Hive this was discussed then and actually <i>not</i> adopted (ORC and vectorized execution had, justifiably, bigger priority).<p>Looking at the numbers presented in OP, it looks very nice. Aggregates, Filters, Sort, Scan ('decoding') show big improvement (I would expected these, is exactly what vectorized execution is best at). I like that Hash-Join also shows significant improvement, is obvious their implementation is better than the HIVE-4850 I did, of which I'm not too proud. The SM/SMB join is not affected, no surprise there.<p>I would like to see a separation of how much of the improvement comes from vectorization vs. how much from code generation. I get the feeling that the way they did it these cannot be separated. I think there is no vectorized plan/operators to compare against the code generation, they implemented both simultaneously. I'm speculating, but I guess the new whole-stage code generation it generates vectorized code, so there is no vectorized execution w/o code generation.<p>All in all, congrats to the DataBricks team. This will have a big impact.<p>[0] <a href="http://oai.cwi.nl/oai/asset/16497/16497B.pdf" rel="nofollow">http://oai.cwi.nl/oai/asset/16497/16497B.pdf</a>
[1] <a href="http://hortonworks.com/blog/100x-faster-hive/" rel="nofollow">http://hortonworks.com/blog/100x-faster-hive/</a>
[2] <a href="https://issues.apache.org/jira/browse/HIVE-4160" rel="nofollow">https://issues.apache.org/jira/browse/HIVE-4160</a>