Whenever you need a lookup table with vectors, remember the instructions like _mm256_cvtepu8_epi32 or _mm256_cvtepi8_epi64 can zero extend or sign extend integer lanes, and they can load source directly from memory. The OP needs a vector with int32 lanes in [0..7] range, can use 1 byte per lane and expand to uint32_t while loading with _mm256_cvtepu8_epi32. Reduces table size by a factor of 4, at the cost of a single instruction.<p>Another thing, it’s possible to produce the desired vectors without RAM loads, with a couple of BMI2 instructions. Example in C++:<p><pre><code> // Compress int32 lanes according to the mask
// The mask is expected to contain either 0 or UINT_MAX values
__m256i compress_epi32( __m256i val, __m256i mask )
{
uint32_t i32 = (uint32_t)_mm256_movemask_epi8( mask );
// Compress, using 4 bit numbers for lane indices
constexpr uint32_t identity = 0x76543210;
i32 = _pext_u32( identity, i32 );
// Expand 4-bit numbers into bytes
constexpr uint64_t expand = 0x0F0F0F0F0F0F0F0Full;
uint64_t i64 = _pdep_u64( (uint64_t)i32, expand );
// Move to SSE vector
__m128i vec = _mm_cvtsi64_si128( (int64_t)i64 );
// Expand bytes to uint32_t
__m256i perm = _mm256_cvtepu8_epi32( vec );
// Shuffle these elements
return _mm256_permutevar8x32_epi32( val, perm );
}
</code></pre>
One downside, these BMI2 instructions are rather slow on Zen 2 and earlier AMD processors. AMD has fixed the performance in Zen 3.
This takes me back to some blissful days spent optimizing an integer compression scheme in AVX2. It's really a difficult instruction set to work with. The authors comment about a set of oddly shaped Legos is very apt. It often doesn't have instructions that cross lanes and often lacks the thing you need, so you find yourself hacking around it with bit operations and lookup tables. AVX512 is a huge improvement, but it still hasn't caught on in a big way over seven years later.
My Rust is not strong, but in the AVX-512 solution, I don't fully understand how they're incrementing the input by a whole AVX-512 word (16xu32) by only doing input = input.offset(1); ? I'd assume that will increment their input array by only 1 single u32.<p>With the approach used here, it also looks like you'll write some garbage into output after output_end, which isn't a problem unless output was sized exactly equally to expected output and you attempted to write past the end of it via _mm512_mask_compressstoreu_epi32 .
Amazing, manual optimizing with algorithms like this can boost performance dramatically indeed. I wonder whether Intel provides AVX-512 optimized libraries for basic algorithms like sorting and searching, or for matrix multiplication? Without those libraries, and auto-vectorization still in research, we will have to hand craft algorithms with intrinsics, which is time consuming.
If you're using Julia and want something like this, you can try `LoopVectorization.vfilter`.
It lowers to LLVM intrinsics that should do the right thing given AVX512, but performance is less than stellar without it IIRC.
May be worth taking another lookat. I'm not licking this cookie; I welcome anyone else wanting to get their hands dirty.
I do like how after all that talk of "idiomatic rust" we still end up with what is substantially assembly code with extra steps to get any real speed