TechEcho

If you are interested in the code behind this, I wrote an overview last month on the functionality and links to the different code that backs the improvements they talk about: <a href="http://hydronitrogen.com/spark-220-cost-based-optimizer-explained.html" rel="nofollow">http://hydronitrogen.com/spark-220-cost-based-optimizer-expl...</a>There's a fair amount of overlap, but where the databricks article explains the techniques with charts and high level explanations, I go over the code instead.

On this topic, I really like the Join Order Benchmark paper: <a href="http://www.vldb.org/pvldb/vol9/p204-leis.pdf" rel="nofollow">http://www.vldb.org/pvldb/vol9/p204-leis.pdf</a>It basically shows that most cost-based optimizers are pretty bad at cardinality estimation, which compounds when queries use more joins.

Still catching up with postgres which added multivariate column statistics in 9.6 :)Not that this isn't a great development in itself...

What's cool about these statistics-based approaches is that you mostly don't even need fully up-to-date statistics, just overall decent stats, unless you have an insane amount of churn. Meaning - you can get query speedup without insertion overhead: you choose to take that overhead any time you want using ANALYZE.Very neat stuff from the databricks team!

Still catching up with postgres which added multivariate column statistics in 9.6 :)Not that this isn't a great development in itself...

Cost-Based Optimizer in Apache Spark 2.2

4 comments

Cost-Based Optimizer in Apache Spark 2.2

4 comments