TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Cost-Based Optimizer in Apache Spark 2.2

98 点作者 dmatrix将近 8 年前

4 条评论

hkothari将近 8 年前
If you are interested in the code behind this, I wrote an overview last month on the functionality and links to the different code that backs the improvements they talk about: <a href="http:&#x2F;&#x2F;hydronitrogen.com&#x2F;spark-220-cost-based-optimizer-explained.html" rel="nofollow">http:&#x2F;&#x2F;hydronitrogen.com&#x2F;spark-220-cost-based-optimizer-expl...</a><p>There&#x27;s a fair amount of overlap, but where the databricks article explains the techniques with charts and high level explanations, I go over the code instead.
elvinyung将近 8 年前
On this topic, I really like the Join Order Benchmark paper: <a href="http:&#x2F;&#x2F;www.vldb.org&#x2F;pvldb&#x2F;vol9&#x2F;p204-leis.pdf" rel="nofollow">http:&#x2F;&#x2F;www.vldb.org&#x2F;pvldb&#x2F;vol9&#x2F;p204-leis.pdf</a><p>It basically shows that most cost-based optimizers are pretty bad at cardinality estimation, which compounds when queries use more joins.
ris将近 8 年前
Still catching up with postgres which added multivariate column statistics in 9.6 :)<p>Not that this isn&#x27;t a great development in itself...
makmanalp将近 8 年前
What&#x27;s cool about these statistics-based approaches is that you mostly don&#x27;t even need fully up-to-date statistics, just overall decent stats, unless you have an insane amount of churn. Meaning - you can get query speedup without insertion overhead: you choose to take that overhead any time you want using ANALYZE.<p>Very neat stuff from the databricks team!
评论 #15143700 未加载