科技回声

8 条评论

zwegner超过 6 年前

OK, I got nerd-sniped here. You can actually construct the indices for the shuffle fairly easily with PEXT. Basically, you have 6 64-bit masks, each corresponding to a different bit of the index of each byte in the 64-byte vector. So for mask 0, a bit is set in the mask if its index has bit (1 << 0) set, mask 1 has the same but for bit (1 << 1), etc. The masks have a simple pattern, that changes between 1 and 0 bits every (1 << i) bits. So for 3 bits the masks would be: 10101010, 11001100, 11110000.These masks are then extracted with PEXT for all the non-whitespace bytes. What this does is build up, bit by bit, the byte index of every non-whitespace byte, compressed down to the least-significant end, without the indices of whitespace bytes.I wasn't actually able to run this code, since I don't have an AVX-512 machine, but I'm pretty sure it should be faster. I put the code on github if anyone wants to try: <a href="https://github.com/zwegner/toys/blob/master/avx512-remove-spaces/avx512vbmi.cpp" rel="nofollow">https://github.com/zwegner/toys/blob/master/avx512-remove-sp...</a><pre><code> const uint64_t index_masks[6] = { 0xaaaaaaaaaaaaaaaa, 0xcccccccccccccccc, 0xf0f0f0f0f0f0f0f0, 0xff00ff00ff00ff00, 0xffff0000ffff0000, 0xffffffff00000000, }; const __m512i index_bits[6] = { _mm512_set1_epi8(1), _mm512_set1_epi8(2), _mm512_set1_epi8(4), _mm512_set1_epi8(8), _mm512_set1_epi8(16), _mm512_set1_epi8(32), }; ...later, inside the loop: mask = ~mask; __m512i indices = _mm512_set1_epi8(0); for (size_t index = 0; index < 6; index++) { uint64_t m = _pext_u64(index_masks[index], mask); indices = _mm512_mask_add_epi8(indices, m, indices, index_bits[index]); } output = _mm512_permutexvar_epi8(indices, input);</code></pre>

评论 #18836655 未加载

评论 #18838612 未加载

评论 #18836966 未加载

jchw超过 6 年前

I was excited for AVX512 long ago but I've since heard that if you are jamming AVX512 instructions to every core you get a forcibly lower clockrate. In practice this sounds like it'd suggest using an AVX512 algorithm could actually be slower even when it's faster. If that's the case, I wonder what kind of performance gain you'd have to hit to beat a scalar or SSE-based vectorized algorithm.

评论 #18836329 未加载

评论 #18835911 未加载

评论 #18835775 未加载

评论 #18835797 未加载

评论 #18836325 未加载

dragontamer超过 6 年前

I've been nerdsniped as well. I can't say I'm going to go ahead and try and solve it, but the methodology presented in the post seems suboptimal.The best method I personally think would work, is the "compaction algorithm" documented here: <a href="http://www.davidespataro.it/cuda-stream-compaction-efficient-implementation/" rel="nofollow">http://www.davidespataro.it/cuda-stream-compaction-efficient...</a>True, that's a CUDA implementation, but AVX512 is closely related to GPU programmers. Effectively, you calculate the prefix sum of the "matches".The paper the above code is based on is very clear on how this works: <a href="http://www.cse.chalmers.se/~uffe/streamcompaction.pdf" rel="nofollow">http://www.cse.chalmers.se/~uffe/streamcompaction.pdf</a>Pay close attention to "figure 1" on page 2. That's the crux of the algorithm. Assuming 8-bit characters, you can generate a prefix-sum in just 6-steps (Each step is a constant, pre-defined byte-shift + Add). A prefix sum is best described by the following picture: <a href="https://en.wikipedia.org/wiki/Prefix_sum#/media/File:Hillis-Steele_Prefix_Sum.svg" rel="nofollow">https://en.wikipedia.org/wiki/Prefix_sum#/media/File:Hillis-...</a>Full Wikipedia page on Prefix Sum: <a href="https://en.wikipedia.org/wiki/Prefix_sum" rel="nofollow">https://en.wikipedia.org/wiki/Prefix_sum</a>Prefix Sum is just 6-steps for a AVX512 register on 8-bit ints. That generates the full AVX512-space permute (ie: if the prefix sum is 5 for an element, that means that element belongs in index #5)., but AVX512 has "in lane" permutes only. I dunno how many steps you'd need to get a "in lane" permute into a "cross lane" permute... but it doesn't seem too difficult of a problem (and IIRC, I think i read a blogpost about how to convert the in-lane AVX512 permutes into a cross-lane one).I bet that the above sketch of the AVX512 algorithm can be implemented in less than 30 assembly instructions for the full AVX512 / 64-byte space, maybe less than 20. That should definitely run faster than the scalar version.-------EDIT: Herp-derp. It doesn't seem like VPERMB is affected by AVX Lanes (!!). <a href="https://www.felixcloutier.com/x86/vpermb" rel="nofollow">https://www.felixcloutier.com/x86/vpermb</a>So I guess you can just run VPERMB at the end on the calculated prefix-sum. The end.-------The Stream Compaction algorithm is a very important 1-dimentional work-balancing paradigm in the GPU programming world. It is used to select which rays are still active in a Raytracing scenario (so that all SIMD registers have something to do).

评论 #18841932 未加载

评论 #18836879 未加载

评论 #18838976 未加载

Const-me超过 6 年前

Interesting but his scalar code is slow. When you care about performance, better to implement such algorithm so it reads bytes one by one, but move blocks with memmove when switching from write to skip state.Pathological case (skipping every other character) is slightly slower, but on real data it’s much faster overall.Unfortunately I don’t have AVX512 hardware so I can’t test.

评论 #18838715 未加载

CountHackulus超过 6 年前

This is really neat, I wonder if there's a way to keep a "remainder" around, kind of like Bresenham's algorithm, so that you can always do aligned reads from memory.The speedup on English text is really good, and I love the exploration into the AVX intrinsics.

评论 #18842976 未加载

lostmsu超过 6 年前

Instead of generating shuffle at runtime, couldn't a table be used for shuffling lower and higher parts of the register separately, then merging the result?Also, for uncommon patterns, the register could be split further to make the shuffle table fit into L1.Also, I am not sure compiler can optimize that continue statement. The non-AVX version might be improved by removing the if alltogether, and replacing dst++ with dst = followed by dst += (src[i] == ' ' || src[i] == '\r' || src[i] == '\n') ? 0 : 1

AVX512 VBMI – remove spaces from text

8 条评论

AVX512 VBMI – remove spaces from text

8 条评论