TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Fastest table sort in the West – Redesigning DuckDB's sort

129 点作者 hfmuehleisen超过 3 年前

6 条评论

gopalv超过 3 年前
This is legitimately fast (7s for 100M), but I&#x27;m more interested in the impact of the disk spills when swapping the hardware underneath, because the pointer swizzling is a very page-dirty way of loading data into memory with a memory map.<p>&gt; When a heap block is offloaded to disk, the pointers pointing into it are invalidated. When we load the block back into memory, the pointers will have changed.<p>&gt; The machine has an Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz, which has 8 cores (up to 16 virtual threads), and 128 GB of RAM, so this time the data fits fully in memory.<p>However, the M1 + SSD is basically cheating on that trend, because that beats most of my server hardware on both memory latency &amp; ssd. Though that fits with how people will use duckdb for local analytics.<p>But otherwise this is page-fault hell.<p>&gt; catalog_sales table is selecting 1 column ... always ordering by cs_quantity and cs_item_sk<p>The choice of sort keys is a bit odd for a comparison like this, because that tuple has a lot of duplicates. So one of the tricks my code in Tez uses to sort faster is the gallop borrowed from Tim Sort, which skips over the identical key sections when doing the merge-sort to do fewer comparisons over all.<p>If you sorted on the primary key for catalog_sales, which is (cs_item_sk, cs_order_number), then that is actually used to store data in-order for sort-merge-joins out of disk (against catalog_returns).<p>And if you get into storage ordering optimizations, you might see a massive difference between ordering them by swapping the order of those columns - if you pull the radix out, then putting the most variable keys in the beginning to skew the bits changing to the first word of the key.
评论 #28330304 未加载
xiaodai超过 3 年前
Many of the techniques discussed are documented in my Julia implementation of the fastest string sorting algorithm <a href="https:&#x2F;&#x2F;www.codementor.io&#x2F;@evalparse&#x2F;julia-vs-r-vs-python-string-sort-performance-an-unfinished-journey-to-optimizing-julia-s-performance-f57tf9f8s" rel="nofollow">https:&#x2F;&#x2F;www.codementor.io&#x2F;@evalparse&#x2F;julia-vs-r-vs-python-st...</a>
legg0myegg0超过 3 年前
This is so fast!! If anybody is using Pandas to keep rows in order and has hesitated to use DuckDB for that reason, hesitate no more! Give it a shot!
评论 #28329777 未加载
eatonphil超过 3 年前
&gt; SQLite always opts for an external sorting strategy. It writes intermediate sorted blocks to disk even if they fit in main-memory, therefore it is much slower.<p>Is this just when avoiding using in-memory databases in SQLite [0]? It seems like SQLite pretty clearly does support fully in-memory operations.<p>The use-case for in-memory SQLite is significantly narrower so that may be why it was not considered in this study. But I&#x27;d still be curious how an in-memory SQLite db compares to the others trialed here.<p>Unless I&#x27;m getting something totally mixed up.<p>[0] <a href="https:&#x2F;&#x2F;www.sqlite.org&#x2F;inmemorydb.html" rel="nofollow">https:&#x2F;&#x2F;www.sqlite.org&#x2F;inmemorydb.html</a>
评论 #28330207 未加载
AdamProut超过 3 年前
Did any of the actual benchmark queries in TPC-H or TPC-DS show a speed up?<p>My intuition is that the performance of large sorts (100 millions of rows) are not that important for most analytical workloads compared to the performance of doing scans, group-bys and joins. Top-sorts (ORDER BY X LIMIT N) are much more popular, but most databases use different algorithms for those.
评论 #28336058 未加载
masklinn超过 3 年前
&gt; While std::sort is excellent algorithmically, it is still a single-threaded approach that is unable to efficiently sort by multiple columns because function call overhead would quickly dominate sorting time.<p>I don&#x27;t understand what this is saying. Is it just using a lot of words to note that as the number of columns increase so do the comparator&#x27;s cost?
评论 #28331071 未加载