This is legitimately fast (7s for 100M), but I'm more interested in the impact of the disk spills when swapping the hardware underneath, because the pointer swizzling is a very page-dirty way of loading data into memory with a memory map.<p>> When a heap block is offloaded to disk, the pointers pointing into it are invalidated. When we load the block back into memory, the pointers will have changed.<p>> The machine has an Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz, which has 8 cores (up to 16 virtual threads), and 128 GB of RAM, so this time the data fits fully in memory.<p>However, the M1 + SSD is basically cheating on that trend, because that beats most of my server hardware on both memory latency & ssd. Though that fits with how people will use duckdb for local analytics.<p>But otherwise this is page-fault hell.<p>> catalog_sales table is selecting 1 column ... always ordering by cs_quantity and cs_item_sk<p>The choice of sort keys is a bit odd for a comparison like this, because that tuple has a lot of duplicates. So one of the tricks my code in Tez uses to sort faster is the gallop borrowed from Tim Sort, which skips over the identical key sections when doing the merge-sort to do fewer comparisons over all.<p>If you sorted on the primary key for catalog_sales, which is (cs_item_sk, cs_order_number), then that is actually used to store data in-order for sort-merge-joins out of disk (against catalog_returns).<p>And if you get into storage ordering optimizations, you might see a massive difference between ordering them by swapping the order of those columns - if you pull the radix out, then putting the most variable keys in the beginning to skew the bits changing to the first word of the key.