Vectorized and performance-portable Quicksort

460 pointsby slackerIIIalmost 3 years ago

21 comments

thecompilralmost 3 years ago

I'm the co-author of one of the papers referenced in the blogpost, (Fast Quicksort Implementation Using AVX Instructions), we did write the AVX512 code back in 2015, just had nowhere to run it, at least publicly. The paper also very explicitly says that the lookup tables can be instead replaced by the AVX512 compress instructions. The code for that paper is available in <a href="https://github.com/vkrasnov/avx_qsort" rel="nofollow">https://github.com/vkrasnov/avx_qsort</a>

jartalmost 3 years ago

I wonder how fast it is compared to djbsort <a href="https://github.com/jart/cosmopolitan/blob/master/libc/nexgen32e/djbsort-avx2.S" rel="nofollow">https://github.com/jart/cosmopolitan/blob/master/libc/nexgen...</a> and longsort <a href="https://github.com/jart/cosmopolitan/blob/e011973593407f576d6e613222a18e04cf19b483/libc/str/longsort.c#L66" rel="nofollow">https://github.com/jart/cosmopolitan/blob/e011973593407f576d...</a> djbsort is outrageously fast for 32-bit ints with avx2 (which unlike avx512 it's the avx we can reliably use in open source). But there's never been a clear instruction set to use on Intel / AMD for sorting 64-bit ints that's reliably fast. So if this thing can actually sort 64-bit integers 10x faster on avx2 I'd be so thrilled.

评论 #31623579 未加载

评论 #31623236 未加载

carbocationalmost 3 years ago

Looks like djb is giving it a spin: <a href="https://twitter.com/hashbreaker/status/1533201734369103872" rel="nofollow">https://twitter.com/hashbreaker/status/1533201734369103872</a>

评论 #31651015 未加载

评论 #31625603 未加载

gwwalmost 3 years ago

I am always amazed at the algorithms re-implemented using SIMD. One of my favorites is the Striped Smith-Waterman approach used for sequence alignment. Does anyone have any good resources on learning to use SIMD? I've found it heard to make the "plunge".

评论 #31623933 未加载

评论 #31623779 未加载

评论 #31626464 未加载

评论 #31628941 未加载

评论 #31631307 未加载

janwasalmost 3 years ago

Author here, happy to discuss.

评论 #31623786 未加载

评论 #31624904 未加载

评论 #31630145 未加载

评论 #31623571 未加载

评论 #31624765 未加载

评论 #31624783 未加载

评论 #31623625 未加载

评论 #31628723 未加载

评论 #31633233 未加载

评论 #31624224 未加载

评论 #31657284 未加载

评论 #31630416 未加载

评论 #31624075 未加载

评论 #31625108 未加载

评论 #31624936 未加载

orlpalmost 3 years ago

SIMD based sorting isn't new, e.g. djbsort: <a href="https://sorting.cr.yp.to/" rel="nofollow">https://sorting.cr.yp.to/</a>. I find it strange the paper doesn't cite it or compare to it at all.EDIT: to be fair it does seem that Bramas' first published work on SIMD sorting (that I could find) predates djbsort, <a href="https://arxiv.org/abs/1704.08579" rel="nofollow">https://arxiv.org/abs/1704.08579</a>. A comparison would still be nice though.

评论 #31623165 未加载

评论 #31623476 未加载

评论 #31623430 未加载

评论 #31623599 未加载

评论 #31622988 未加载

VHRangeralmost 3 years ago

I love work like this and SIMD-JSON which vectorizes what "should be" sequential form problems to find massive performance gains

评论 #31624768 未加载

评论 #31623969 未加载

DeathArrowalmost 3 years ago

This might interest someone: "Vectorized Operations extension for PostgreSQL"<a href="https://github.com/postgrespro/vops" rel="nofollow">https://github.com/postgrespro/vops</a>

评论 #31624676 未加载

jerryjerryjerryalmost 3 years ago

This is great, and can definitely improve sort performance in database and big data projects. I can immediately imagine this is a perfect match to my project of columnar processing engine TiFlash (<a href="https://github.com/pingcap/tiflash" rel="nofollow">https://github.com/pingcap/tiflash</a>).

cbettialmost 3 years ago

Not having played with SIMD much myself, does leveraging these instructions for an intensive operation like a sort push other workloads out of the CPU more aggressively than operating on 32 or 64 bits at a time would?In other words, do you have to be more careful when integrating these wide operators to preserve some resources for other operations?

评论 #31622917 未加载

评论 #31622933 未加载

评论 #31622839 未加载

评论 #31624776 未加载

MrYellowPalmost 3 years ago

Man, I was thinking about this problem and came up with some weird solutions.I've actually thought all this stuff has already been solved?Is there a point in still trying to improve efficiency of sorting?I can do that! I'd love such a challenge if there was a point to it!

评论 #31624950 未加载

slackerIIIalmost 3 years ago

What would it take for something like this to make it into Postgres?

评论 #31622905 未加载

评论 #31624715 未加载

olliejalmost 3 years ago

Re-implementation of stuff with SIMD is always amazing to me. I have done stuff with SIMD before: 4 element float vector operations; basic arithmetic on arrays of floats.Those are things that are super simple, once you get past the scary mystique of SIMD.It’s stuff like this that should be getting all the witch burning :D

skybrianalmost 3 years ago

Will browsers start using this code for typed array sorting? Maybe they already do?

评论 #31624400 未加载

BeeOnRopealmost 3 years ago

Are there instructions for building & reproducing the performance results?

评论 #31628386 未加载

wly_cdgralmost 3 years ago

SpaceChem players have known this for years

aaaaaaaaaaabalmost 3 years ago

This works only with integers, right? Then radix sort is gonna be still faster on sufficiently large inputs…

评论 #31628405 未加载

sylwarealmost 3 years ago

This code is worth a clean assembly port.

评论 #31630945 未加载

diamondlovesyoualmost 3 years ago

If P=NP, it only going to be possible because of vectorizing/vectorization.

aleccoalmost 3 years ago

> By comparison, the standard library reaches 58/128/117 MB/s on the same CPU, so we have managed a 9-19x speedup depending on the type of numbers.That's pretty disingenuous. The standard implementation is known to be terribly slow because of constraints. They should compare against current state of the art.

评论 #31622852 未加载

评论 #31622833 未加载

评论 #31622851 未加载

unnouinceputalmost 3 years ago

Quote: "Today we're sharing open source code that can sort arrays of numbers about ten times as fast as the C++ std::sort...." and proceeds to explain about SIMD instructions.Well, to me sounds that C++ std::sort is simply lacking an optimization which can very easy be implemented in next iteration of the compiler/linker. I had high hopes this would be a really breakthrough algorithm, not lack of a simple optimization using some instruction set that compiler/linker didn't implement. If they want, CPU vendors would make an even more awesome and faster instruction set, let's say SIMD2, which would leave this one in the dust.

评论 #31623739 未加载