This is legitimately fast (7s for 100M), but I'm more interested in the impact of the disk spills when swapping the hardware underneath, because the pointer swizzling is a very page-dirty way of loading data into memory with a memory map.<p>> When a heap block is offloaded to disk, the pointers pointing into it are invalidated. When we load the block back into memory, the pointers will have changed.<p>> The machine has an Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz, which has 8 cores (up to 16 virtual threads), and 128 GB of RAM, so this time the data fits fully in memory.<p>However, the M1 + SSD is basically cheating on that trend, because that beats most of my server hardware on both memory latency & ssd. Though that fits with how people will use duckdb for local analytics.<p>But otherwise this is page-fault hell.<p>> catalog_sales table is selecting 1 column ... always ordering by cs_quantity and cs_item_sk<p>The choice of sort keys is a bit odd for a comparison like this, because that tuple has a lot of duplicates. So one of the tricks my code in Tez uses to sort faster is the gallop borrowed from Tim Sort, which skips over the identical key sections when doing the merge-sort to do fewer comparisons over all.<p>If you sorted on the primary key for catalog_sales, which is (cs_item_sk, cs_order_number), then that is actually used to store data in-order for sort-merge-joins out of disk (against catalog_returns).<p>And if you get into storage ordering optimizations, you might see a massive difference between ordering them by swapping the order of those columns - if you pull the radix out, then putting the most variable keys in the beginning to skew the bits changing to the first word of the key.
Many of the techniques discussed are documented in my Julia implementation of the fastest string sorting algorithm <a href="https://www.codementor.io/@evalparse/julia-vs-r-vs-python-string-sort-performance-an-unfinished-journey-to-optimizing-julia-s-performance-f57tf9f8s" rel="nofollow">https://www.codementor.io/@evalparse/julia-vs-r-vs-python-st...</a>
> SQLite always opts for an external sorting strategy. It writes intermediate sorted blocks to disk even if they fit in main-memory, therefore it is much slower.<p>Is this just when avoiding using in-memory databases in SQLite [0]? It seems like SQLite pretty clearly does support fully in-memory operations.<p>The use-case for in-memory SQLite is significantly narrower so that may be why it was not considered in this study. But I'd still be curious how an in-memory SQLite db compares to the others trialed here.<p>Unless I'm getting something totally mixed up.<p>[0] <a href="https://www.sqlite.org/inmemorydb.html" rel="nofollow">https://www.sqlite.org/inmemorydb.html</a>
Did any of the actual benchmark queries in TPC-H or TPC-DS show a speed up?<p>My intuition is that the performance of large sorts (100 millions of rows) are not that important for most analytical workloads compared to the performance of doing scans, group-bys and joins. Top-sorts (ORDER BY X LIMIT N) are much more popular, but most databases use different algorithms for those.
> While std::sort is excellent algorithmically, it is still a single-threaded approach that is unable to efficiently sort by multiple columns because function call overhead would quickly dominate sorting time.<p>I don't understand what this is saying. Is it just using a lot of words to note that as the number of columns increase so do the comparator's cost?