TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

AVX512 VBMI – remove spaces from text

152 点作者 pedro84超过 6 年前

8 条评论

zwegner超过 6 年前
OK, I got nerd-sniped here. You can actually construct the indices for the shuffle fairly easily with PEXT. Basically, you have 6 64-bit masks, each corresponding to a different bit of the index of each byte in the 64-byte vector. So for mask 0, a bit is set in the mask if its index has bit (1 &lt;&lt; 0) set, mask 1 has the same but for bit (1 &lt;&lt; 1), etc. The masks have a simple pattern, that changes between 1 and 0 bits every (1 &lt;&lt; i) bits. So for 3 bits the masks would be: 10101010, 11001100, 11110000.<p>These masks are then extracted with PEXT for all the non-whitespace bytes. What this does is build up, bit by bit, the byte index of every non-whitespace byte, compressed down to the least-significant end, without the indices of whitespace bytes.<p>I wasn&#x27;t actually able to run this code, since I don&#x27;t have an AVX-512 machine, but I&#x27;m pretty sure it should be faster. I put the code on github if anyone wants to try: <a href="https:&#x2F;&#x2F;github.com&#x2F;zwegner&#x2F;toys&#x2F;blob&#x2F;master&#x2F;avx512-remove-spaces&#x2F;avx512vbmi.cpp" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;zwegner&#x2F;toys&#x2F;blob&#x2F;master&#x2F;avx512-remove-sp...</a><p><pre><code> const uint64_t index_masks[6] = { 0xaaaaaaaaaaaaaaaa, 0xcccccccccccccccc, 0xf0f0f0f0f0f0f0f0, 0xff00ff00ff00ff00, 0xffff0000ffff0000, 0xffffffff00000000, }; const __m512i index_bits[6] = { _mm512_set1_epi8(1), _mm512_set1_epi8(2), _mm512_set1_epi8(4), _mm512_set1_epi8(8), _mm512_set1_epi8(16), _mm512_set1_epi8(32), }; ...later, inside the loop: mask = ~mask; __m512i indices = _mm512_set1_epi8(0); for (size_t index = 0; index &lt; 6; index++) { uint64_t m = _pext_u64(index_masks[index], mask); indices = _mm512_mask_add_epi8(indices, m, indices, index_bits[index]); } output = _mm512_permutexvar_epi8(indices, input);</code></pre>
评论 #18836655 未加载
评论 #18838612 未加载
评论 #18836966 未加载
jchw超过 6 年前
I was excited for AVX512 long ago but I&#x27;ve since heard that if you are jamming AVX512 instructions to every core you get a forcibly lower clockrate. In practice this sounds like it&#x27;d suggest using an AVX512 algorithm could actually be slower even when it&#x27;s faster. If that&#x27;s the case, I wonder what kind of performance gain you&#x27;d have to hit to beat a scalar or SSE-based vectorized algorithm.
评论 #18836329 未加载
评论 #18835911 未加载
评论 #18835775 未加载
评论 #18835797 未加载
评论 #18836325 未加载
dragontamer超过 6 年前
I&#x27;ve been nerdsniped as well. I can&#x27;t say I&#x27;m going to go ahead and try and solve it, but the methodology presented in the post seems suboptimal.<p>The best method I personally think would work, is the &quot;compaction algorithm&quot; documented here: <a href="http:&#x2F;&#x2F;www.davidespataro.it&#x2F;cuda-stream-compaction-efficient-implementation&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.davidespataro.it&#x2F;cuda-stream-compaction-efficient...</a><p>True, that&#x27;s a CUDA implementation, but AVX512 is closely related to GPU programmers. Effectively, you calculate the prefix sum of the &quot;matches&quot;.<p>The paper the above code is based on is very clear on how this works: <a href="http:&#x2F;&#x2F;www.cse.chalmers.se&#x2F;~uffe&#x2F;streamcompaction.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cse.chalmers.se&#x2F;~uffe&#x2F;streamcompaction.pdf</a><p>Pay close attention to &quot;figure 1&quot; on page 2. That&#x27;s the crux of the algorithm. Assuming 8-bit characters, you can generate a prefix-sum in just 6-steps (Each step is a constant, pre-defined byte-shift + Add). A prefix sum is best described by the following picture: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Prefix_sum#&#x2F;media&#x2F;File:Hillis-Steele_Prefix_Sum.svg" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Prefix_sum#&#x2F;media&#x2F;File:Hillis-...</a><p>Full Wikipedia page on Prefix Sum: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Prefix_sum" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Prefix_sum</a><p>Prefix Sum is just 6-steps for a AVX512 register on 8-bit ints. That generates the full AVX512-space permute (ie: if the prefix sum is 5 for an element, that means that element belongs in index #5)., but AVX512 has &quot;in lane&quot; permutes only. I dunno how many steps you&#x27;d need to get a &quot;in lane&quot; permute into a &quot;cross lane&quot; permute... but it doesn&#x27;t seem too difficult of a problem (and IIRC, I think i read a blogpost about how to convert the in-lane AVX512 permutes into a cross-lane one).<p>I bet that the above sketch of the AVX512 algorithm can be implemented in less than 30 assembly instructions for the full AVX512 &#x2F; 64-byte space, maybe less than 20. That should definitely run faster than the scalar version.<p>-------<p>EDIT: Herp-derp. It doesn&#x27;t seem like VPERMB is affected by AVX Lanes (!!). <a href="https:&#x2F;&#x2F;www.felixcloutier.com&#x2F;x86&#x2F;vpermb" rel="nofollow">https:&#x2F;&#x2F;www.felixcloutier.com&#x2F;x86&#x2F;vpermb</a><p>So I guess you can just run VPERMB at the end on the calculated prefix-sum. The end.<p>-------<p>The Stream Compaction algorithm is a very important 1-dimentional work-balancing paradigm in the GPU programming world. It is used to select which rays are still active in a Raytracing scenario (so that all SIMD registers have something to do).
评论 #18841932 未加载
评论 #18836879 未加载
评论 #18838976 未加载
Const-me超过 6 年前
Interesting but his scalar code is slow. When you care about performance, better to implement such algorithm so it reads bytes one by one, but move blocks with memmove when switching from write to skip state.<p>Pathological case (skipping every other character) is slightly slower, but on real data it’s much faster overall.<p>Unfortunately I don’t have AVX512 hardware so I can’t test.
评论 #18838715 未加载
CountHackulus超过 6 年前
This is really neat, I wonder if there&#x27;s a way to keep a &quot;remainder&quot; around, kind of like Bresenham&#x27;s algorithm, so that you can always do aligned reads from memory.<p>The speedup on English text is really good, and I love the exploration into the AVX intrinsics.
评论 #18842976 未加载
lostmsu超过 6 年前
Instead of generating shuffle at runtime, couldn&#x27;t a table be used for shuffling lower and higher parts of the register separately, then merging the result?<p>Also, for uncommon patterns, the register could be split further to make the shuffle table fit into L1.<p>Also, I am not sure compiler can optimize that continue statement. The non-AVX version might be improved by removing the if alltogether, and replacing <i>dst++</i> with <i>dst =</i> followed by <i>dst += (src[i] == &#x27; &#x27; || src[i] == &#x27;\r&#x27; || src[i] == &#x27;\n&#x27;) ? 0 : 1</i>
评论 #18835804 未加载
评论 #18836864 未加载
terrycody超过 6 年前
what is the usage of these things?
pkaye超过 6 年前
What if you are working with unicode?
评论 #18837028 未加载