TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Filtering a vector with SIMD instructions (AVX-2 and AVX-512)

104 pointsby francoismassotover 2 years ago

9 comments

Const-meover 2 years ago
Whenever you need a lookup table with vectors, remember the instructions like _mm256_cvtepu8_epi32 or _mm256_cvtepi8_epi64 can zero extend or sign extend integer lanes, and they can load source directly from memory. The OP needs a vector with int32 lanes in [0..7] range, can use 1 byte per lane and expand to uint32_t while loading with _mm256_cvtepu8_epi32. Reduces table size by a factor of 4, at the cost of a single instruction.<p>Another thing, it’s possible to produce the desired vectors without RAM loads, with a couple of BMI2 instructions. Example in C++:<p><pre><code> &#x2F;&#x2F; Compress int32 lanes according to the mask &#x2F;&#x2F; The mask is expected to contain either 0 or UINT_MAX values __m256i compress_epi32( __m256i val, __m256i mask ) { uint32_t i32 = (uint32_t)_mm256_movemask_epi8( mask ); &#x2F;&#x2F; Compress, using 4 bit numbers for lane indices constexpr uint32_t identity = 0x76543210; i32 = _pext_u32( identity, i32 ); &#x2F;&#x2F; Expand 4-bit numbers into bytes constexpr uint64_t expand = 0x0F0F0F0F0F0F0F0Full; uint64_t i64 = _pdep_u64( (uint64_t)i32, expand ); &#x2F;&#x2F; Move to SSE vector __m128i vec = _mm_cvtsi64_si128( (int64_t)i64 ); &#x2F;&#x2F; Expand bytes to uint32_t __m256i perm = _mm256_cvtepu8_epi32( vec ); &#x2F;&#x2F; Shuffle these elements return _mm256_permutevar8x32_epi32( val, perm ); } </code></pre> One downside, these BMI2 instructions are rather slow on Zen 2 and earlier AMD processors. AMD has fixed the performance in Zen 3.
slashdevover 2 years ago
This takes me back to some blissful days spent optimizing an integer compression scheme in AVX2. It&#x27;s really a difficult instruction set to work with. The authors comment about a set of oddly shaped Legos is very apt. It often doesn&#x27;t have instructions that cross lanes and often lacks the thing you need, so you find yourself hacking around it with bit operations and lookup tables. AVX512 is a huge improvement, but it still hasn&#x27;t caught on in a big way over seven years later.
评论 #32681426 未加载
aqritover 2 years ago
The AVX2 version is using a 8 KiB table...? Even the range check is inefficient. I&#x27;d bet that the AVX2 version could be 50% faster.
评论 #32682349 未加载
stu2010over 2 years ago
My Rust is not strong, but in the AVX-512 solution, I don&#x27;t fully understand how they&#x27;re incrementing the input by a whole AVX-512 word (16xu32) by only doing input = input.offset(1); ? I&#x27;d assume that will increment their input array by only 1 single u32.<p>With the approach used here, it also looks like you&#x27;ll write some garbage into output after output_end, which isn&#x27;t a problem unless output was sized exactly equally to expected output and you attempted to write past the end of it via _mm512_mask_compressstoreu_epi32 .
评论 #32682194 未加载
synergy20over 2 years ago
Amazing, manual optimizing with algorithms like this can boost performance dramatically indeed. I wonder whether Intel provides AVX-512 optimized libraries for basic algorithms like sorting and searching, or for matrix multiplication? Without those libraries, and auto-vectorization still in research, we will have to hand craft algorithms with intrinsics, which is time consuming.
评论 #32680867 未加载
评论 #32680340 未加载
评论 #32680689 未加载
celrodover 2 years ago
If you&#x27;re using Julia and want something like this, you can try `LoopVectorization.vfilter`. It lowers to LLVM intrinsics that should do the right thing given AVX512, but performance is less than stellar without it IIRC. May be worth taking another lookat. I&#x27;m not licking this cookie; I welcome anyone else wanting to get their hands dirty.
dmitrygrover 2 years ago
I do like how after all that talk of &quot;idiomatic rust&quot; we still end up with what is substantially assembly code with extra steps to get any real speed
评论 #32681071 未加载
kklisuraover 2 years ago
Are any of these instruction sets available on M1? If not, what&#x27;s the alternative there?
评论 #32681772 未加载
评论 #32683823 未加载
hotenover 2 years ago
This page crashes mobile Chrome (Android), anyone else see that?
评论 #32679874 未加载
评论 #32680829 未加载
评论 #32687836 未加载