TechEcho

9 comments

Const-meover 2 years ago

Whenever you need a lookup table with vectors, remember the instructions like _mm256_cvtepu8_epi32 or _mm256_cvtepi8_epi64 can zero extend or sign extend integer lanes, and they can load source directly from memory. The OP needs a vector with int32 lanes in [0..7] range, can use 1 byte per lane and expand to uint32_t while loading with _mm256_cvtepu8_epi32. Reduces table size by a factor of 4, at the cost of a single instruction.<p>Another thing, it’s possible to produce the desired vectors without RAM loads, with a couple of BMI2 instructions. Example in C++:<p><pre><code> // Compress int32 lanes according to the mask // The mask is expected to contain either 0 or UINT_MAX values __m256i compress_epi32( __m256i val, __m256i mask ) { uint32_t i32 = (uint32_t)_mm256_movemask_epi8( mask ); // Compress, using 4 bit numbers for lane indices constexpr uint32_t identity = 0x76543210; i32 = _pext_u32( identity, i32 ); // Expand 4-bit numbers into bytes constexpr uint64_t expand = 0x0F0F0F0F0F0F0F0Full; uint64_t i64 = _pdep_u64( (uint64_t)i32, expand ); // Move to SSE vector __m128i vec = _mm_cvtsi64_si128( (int64_t)i64 ); // Expand bytes to uint32_t __m256i perm = _mm256_cvtepu8_epi32( vec ); // Shuffle these elements return _mm256_permutevar8x32_epi32( val, perm ); } </code></pre> One downside, these BMI2 instructions are rather slow on Zen 2 and earlier AMD processors. AMD has fixed the performance in Zen 3.

slashdevover 2 years ago

This takes me back to some blissful days spent optimizing an integer compression scheme in AVX2. It's really a difficult instruction set to work with. The authors comment about a set of oddly shaped Legos is very apt. It often doesn't have instructions that cross lanes and often lacks the thing you need, so you find yourself hacking around it with bit operations and lookup tables. AVX512 is a huge improvement, but it still hasn't caught on in a big way over seven years later.

评论 #32681426 未加载

aqritover 2 years ago

The AVX2 version is using a 8 KiB table...? Even the range check is inefficient. I'd bet that the AVX2 version could be 50% faster.

评论 #32682349 未加载

stu2010over 2 years ago

My Rust is not strong, but in the AVX-512 solution, I don't fully understand how they're incrementing the input by a whole AVX-512 word (16xu32) by only doing input = input.offset(1); ? I'd assume that will increment their input array by only 1 single u32.<p>With the approach used here, it also looks like you'll write some garbage into output after output_end, which isn't a problem unless output was sized exactly equally to expected output and you attempted to write past the end of it via _mm512_mask_compressstoreu_epi32 .

评论 #32682194 未加载

synergy20over 2 years ago

Amazing, manual optimizing with algorithms like this can boost performance dramatically indeed. I wonder whether Intel provides AVX-512 optimized libraries for basic algorithms like sorting and searching, or for matrix multiplication? Without those libraries, and auto-vectorization still in research, we will have to hand craft algorithms with intrinsics, which is time consuming.

评论 #32680867 未加载

评论 #32680340 未加载

评论 #32680689 未加载

celrodover 2 years ago

If you're using Julia and want something like this, you can try `LoopVectorization.vfilter`. It lowers to LLVM intrinsics that should do the right thing given AVX512, but performance is less than stellar without it IIRC. May be worth taking another lookat. I'm not licking this cookie; I welcome anyone else wanting to get their hands dirty.

dmitrygrover 2 years ago

I do like how after all that talk of "idiomatic rust" we still end up with what is substantially assembly code with extra steps to get any real speed

评论 #32681071 未加载

kklisuraover 2 years ago

Are any of these instruction sets available on M1? If not, what's the alternative there?

评论 #32681772 未加载

评论 #32683823 未加载

hotenover 2 years ago

This page crashes mobile Chrome (Android), anyone else see that?

评论 #32679874 未加载

评论 #32680829 未加载

评论 #32687836 未加载

9 comments

Const-meover 2 years ago

slashdevover 2 years ago

评论 #32681426 未加载

aqritover 2 years ago

The AVX2 version is using a 8 KiB table...? Even the range check is inefficient. I'd bet that the AVX2 version could be 50% faster.

评论 #32682349 未加载

stu2010over 2 years ago

评论 #32682194 未加载

synergy20over 2 years ago

评论 #32680867 未加载

评论 #32680340 未加载

评论 #32680689 未加载

celrodover 2 years ago

dmitrygrover 2 years ago

I do like how after all that talk of "idiomatic rust" we still end up with what is substantially assembly code with extra steps to get any real speed

评论 #32681071 未加载

kklisuraover 2 years ago

Are any of these instruction sets available on M1? If not, what's the alternative there?

Filtering a vector with SIMD instructions (AVX-2 and AVX-512)

9 comments

Filtering a vector with SIMD instructions (AVX-2 and AVX-512)

9 comments