I'm the co-author of one of the papers referenced in the blogpost, (Fast Quicksort Implementation Using AVX Instructions), we did write the AVX512 code back in 2015, just had nowhere to run it, at least publicly. The paper also very explicitly says that the lookup tables can be instead replaced by the AVX512 compress instructions. The code for that paper is available in <a href="https://github.com/vkrasnov/avx_qsort" rel="nofollow">https://github.com/vkrasnov/avx_qsort</a>
I wonder how fast it is compared to djbsort <a href="https://github.com/jart/cosmopolitan/blob/master/libc/nexgen32e/djbsort-avx2.S" rel="nofollow">https://github.com/jart/cosmopolitan/blob/master/libc/nexgen...</a> and longsort <a href="https://github.com/jart/cosmopolitan/blob/e011973593407f576d6e613222a18e04cf19b483/libc/str/longsort.c#L66" rel="nofollow">https://github.com/jart/cosmopolitan/blob/e011973593407f576d...</a> djbsort is outrageously fast for 32-bit ints with avx2 (which unlike avx512 it's the avx we can reliably use in open source). But there's never been a clear instruction set to use on Intel / AMD for sorting 64-bit ints that's reliably fast. So if this thing can actually sort 64-bit integers 10x faster on avx2 I'd be so thrilled.
Looks like djb is giving it a spin: <a href="https://twitter.com/hashbreaker/status/1533201734369103872" rel="nofollow">https://twitter.com/hashbreaker/status/1533201734369103872</a>
I am always amazed at the algorithms re-implemented using SIMD. One of my favorites is the Striped Smith-Waterman approach used for sequence alignment. Does anyone have any good resources on learning to use SIMD? I've found it heard to make the "plunge".
SIMD based sorting isn't new, e.g. djbsort: <a href="https://sorting.cr.yp.to/" rel="nofollow">https://sorting.cr.yp.to/</a>. I find it strange the paper doesn't cite it or compare to it at all.<p>EDIT: to be fair it does seem that Bramas' first published work on SIMD sorting (that I could find) predates djbsort, <a href="https://arxiv.org/abs/1704.08579" rel="nofollow">https://arxiv.org/abs/1704.08579</a>. A comparison would still be nice though.
This is great, and can definitely improve sort performance in database and big data projects. I can immediately imagine this is a perfect match to my project of columnar processing engine TiFlash (<a href="https://github.com/pingcap/tiflash" rel="nofollow">https://github.com/pingcap/tiflash</a>).
Not having played with SIMD much myself, does leveraging these instructions for an intensive operation like a sort push other workloads out of the CPU more aggressively than operating on 32 or 64 bits at a time would?<p>In other words, do you have to be more careful when integrating these wide operators to preserve some resources for other operations?
Man, I was thinking about this problem and came up with some weird solutions.<p>I've actually thought all this stuff has already been solved?<p>Is there a point in still trying to improve efficiency of sorting?<p>I can do that! I'd love such a challenge if there was a point to it!
Re-implementation of stuff with SIMD is always amazing to me. I have done stuff with SIMD before: 4 element float vector operations; basic arithmetic on arrays of floats.<p>Those are things that are super simple, once you get past the scary mystique of SIMD.<p>It’s stuff like this that should be getting all the witch burning :D
> By comparison, the standard library reaches 58/128/117 MB/s on the same CPU, so we have managed a 9-19x speedup depending on the type of numbers.<p>That's pretty disingenuous. The standard implementation is known to be terribly slow because of constraints. They should compare against current state of the art.
Quote: "Today we're sharing open source code that can sort arrays of numbers about ten times as fast as the C++ std::sort...." and proceeds to explain about SIMD instructions.<p>Well, to me sounds that C++ std::sort is simply lacking an optimization which can very easy be implemented in next iteration of the compiler/linker. I had high hopes this would be a really breakthrough algorithm, not lack of a simple optimization using some instruction set that compiler/linker didn't implement. If they want, CPU vendors would make an even more awesome and faster instruction set, let's say SIMD2, which would leave this one in the dust.