TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

JIT-Compiling SQL Queries in PostgreSQL Using LLVM

128 点作者 croh大约 4 年前

9 条评论

murkt大约 4 年前
It&#x27;s enabled by default since Postgres 12, and in some cases can be harmful, I mentioned it in my blog post about JSONB indexes [1].<p>I found it hard to make Postgres reuse JIT-compiled plans, and quite often JIT compilation can kick in when it thinks that there will be many results and spend more time in compilation than executing a non-compiled query. Postgres needs a setting to increase JIT compilation threshold based on a query plan complexity.<p>[1] <a href="https:&#x2F;&#x2F;vsevolod.net&#x2F;postgresql-jsonb-index&#x2F;#jit-compiling-bad-plans" rel="nofollow">https:&#x2F;&#x2F;vsevolod.net&#x2F;postgresql-jsonb-index&#x2F;#jit-compiling-b...</a>
评论 #26222335 未加载
评论 #26225041 未加载
评论 #26224748 未加载
derefr大约 4 年前
I know RDBMSes are supposed to hide the “raw” procedural query plan underneath a declarative “what data do you want” abstraction layer... but what if I know exactly the query plan I want? Why can’t I upload a “prepared query plan” and execute <i>exactly</i> that query plan?<p>In most RDBMSes, the problem with that would be that there’s no existing formal intermediate representation of a query plan, to use on the wire.<p>But in PG’s case, that’d be LLVM IR, no?<p>I haven’t looked too deeply into what Postgres exposes via its extension APIs, but I feel like it wouldn’t be impossible to write a PG extension that would let you call a stored procedure with a bytea containing the LLVM IR for a query plan, and get back a handle to a “prepared statement that’s a little bit more” — where EXECUTEing that statement would just reuse the cached query-plan attached to it, as if PG had previously JITed a plan for that statement.<p>If I built something like this, would anyone want to use it? It’d have the disadvantage of you having to figure out how to generate LLVM IR on the client side representing your desired query plan; and that query plan potentially getting out-of-sync with the DB. But with stable DB schemas + data sizes (e.g. in data warehousing) it seems like a win.
评论 #26227621 未加载
评论 #26226721 未加载
评论 #26226404 未加载
评论 #26227294 未加载
darksaints大约 4 年前
A couple of things to point out: the JIT is very dependent on the Postgres cost model, which is still pretty primitive. For example, the cost model is not aware of TOAST sizes, which can drastically impact function execution costs if that function is size dependent (very common with things like geospatial queries). Another problem is the lack of join statistics...you can have perfectly recorded statics for both tables, but the combination of tables or columns used in the join, or possibly the expression used for the join, can cause cost estimate chaos.<p>I know JIT plan caching is a hard problem, but it would be very helpful. But I actually think JIT is the wrong solution to this problem. Instead of recompiling an optimized plan every time, we should actually be using PGO. We already have a profiling mechanism built in: Explain Analyze. All that is needed is for that profiling mechanism to link up with the optimizer, giving it all the information it needs to do all of the same optimizations that we get with JIT today, but combined with extremely powerful global optimizations that no JIT would ever even attempt. And we already have the perfect use cases too: every view or materialized view is already based on a query that will be used regularly, and is worth profiling and optimizing.
评论 #26225739 未加载
asguy大约 4 年前
This is a good read, but it probably should be marked as coming from 2017.
评论 #26217221 未加载
megous大约 4 年前
Seeing that PostgreSQL has pluggable JIT backends, maybe using MIR would be a faster and more lightweight alternative to LLVM.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;vnmakarov&#x2F;mir" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;vnmakarov&#x2F;mir</a>
评论 #26223585 未加载
ltbarcly3大约 4 年前
In my experience tuning postgres the JIT has rarely been a net win. It is very reliant on postgres&#x27; cost estimate, which becomes a problem. The more complex the query is, the more likely the JIT is to help, but, in general, the worse the cost estimate becomes. This often leads to cases where the cost estimate causes a full JIT&#x27;ing of query, and the time to jit and optimize is longer than the entire runtime of the query with the JIT turned off.<p>I thought maybe this was due to the default operation costs not having been updated for a long time (and being created when there was no such thing as an SSD etc), and so I tried to optimize them. I wrote a script which used a scipy global optimizer (<a href="https:&#x2F;&#x2F;docs.scipy.org&#x2F;doc&#x2F;scipy&#x2F;reference&#x2F;generated&#x2F;scipy.optimize.differential_evolution.html" rel="nofollow">https:&#x2F;&#x2F;docs.scipy.org&#x2F;doc&#x2F;scipy&#x2F;reference&#x2F;generated&#x2F;scipy.o...</a>) to run a set of real life queries against a copy of my database, and with a goal of finding a postgresql cost estimate configuration that minimized the total query time.<p>This optimization took over a day to run, and it would have taken much longer except I wrote a function to diff query plans and cache the runtimes of each query for identical query plans.<p>Here is what it ended up with:<p><pre><code> cpu_index_tuple_cost = 0.05935355125176216 cpu_operator_cost = 0.012593074414763324 cpu_tuple_cost = 0.15786073958355235 random_page_cost = 0.4529142183444561 seq_page_cost = 1.163448763641469 parallel_tuple_cost = 0.18328552089499073 parallel_setup_cost = 30630.595453975257 join_collapse_limit = 29.62367337441951 from_collapse_limit = 18.30069861231032 geqo_threshold = 20.35752330027858 geqo_effort = 1.9431285958606317 min_parallel_table_scan_size = 42.64686041069217 min_parallel_index_scan_size = 897.8594030984586 jit_above_cost = 1060139.219983833 jit_inline_above_cost = 5674002.74709 jit_optimize_above_cost = 4974481.895365894 </code></pre> Doing this global optimization was made more difficult by the fact that most, but not all of the costs are only meaningful in terms of relative size. For example, setting random_page_cost to 3x seq_page_cost is just a relative cost, if you scale all these costs by some constant you will get the same query plan. However, the introduction of the *_size parameters have the effect of anchoring all the other parameters, so the search space goes from searching an N-dimensional unit sphere to an entire N dimensional space. While the unit sphere is only one dimension less, it is also a much smaller space in terms of boundaries, since each variable can be limited to the range of (0, pi) without imposing artificial absolute numerical guesses (which could easily exclude the optimal configuration from the searched space) to make the search space small enough that a global optimization is tractable.<p>(Things like join_collapse_limit aren&#x27;t scale free but they also refer only to the query and not to the data, so they are independent, and besides, they have a very easy to guess range of acceptable values.)<p>Anyway, that is a long way of saying that the above did have a measurable effect on performance for my use case, but only on average, some queries still ended up being slower than using default settings. Looking into these, they were for the most part cases where the rowcount estimates the planner were generating were wrong by orders of magnitude, and this incorrect rowcount composed with the incorrect cost configuration caused the planner to pick a good query plan by luck.<p>To sum up, after analyzing the performance of hundreds and hundreds of queries, I have yet to see the JIT make one significantly faster, and I very often see much much worse performance. When I looked into what the JIT was doing, it appears that it is compiling hundreds and hundreds of functions for somewhat straightforward queries, which looked like a bug, but I didn&#x27;t have time to look into it further. I ended up just cranking up the thresholds on the usage of the JIT to very very large numbers, under the theory that I want the JIT basically turned off, but if there is a query with a cost estimate that is astronomical, the time taken to JIT that query is probably negligible compared to the total runtime so it probably won&#x27;t hurt.<p>Meanwhile, if you have a database server with spare CPU (and IO if your database doesn&#x27;t fit in memory) capacity during peak times, I would look at tuning the parallel query parameters to encourage more parallelization. So far in my testing this can give you very close to a 1&#x2F;K query time for a query that uses K workers. It is fantastic.
评论 #26233367 未加载
评论 #26225371 未加载
评论 #26223256 未加载
rrdharan大约 4 年前
See also: <a href="https:&#x2F;&#x2F;www.citusdata.com&#x2F;blog&#x2F;2018&#x2F;09&#x2F;11&#x2F;postgresql-11-just-in-time&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.citusdata.com&#x2F;blog&#x2F;2018&#x2F;09&#x2F;11&#x2F;postgresql-11-just...</a>
CalChris大约 4 年前
The talk is available:<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=ZNTeVt8V6PU" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=ZNTeVt8V6PU</a>
runeks大约 4 年前
This sounds like a cool idea, but I don’t think it’s very usable in practice.<p>At least not when using SSD&#x2F;HDD for storage, where — in my experience — CPU time never exceeds 50% of the time spent executing a query. This means that the <i>maximum</i> speedup is 2x, without even accounting for the time spent performing JIT-compilation.<p>So we have a situation where the maximal theoretical speedup is 2x and the maximal slowdown is unbounded.