TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Removing characters from strings faster with AVX-512

146 点作者 mdb31大约 3 年前

11 条评论

Andoryuuta大约 3 年前
Intel is removing AVX-512 support from their newer CPU&#x27;s (Alder Lake +). :&#x2F;<p><a href="https:&#x2F;&#x2F;www.igorslab.de&#x2F;en&#x2F;intel-deactivated-avx-512-on-alder-lake-but-fully-questionable-interpretation-of-efficiency-news-editorial&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.igorslab.de&#x2F;en&#x2F;intel-deactivated-avx-512-on-alde...</a>
评论 #31225783 未加载
评论 #31225694 未加载
评论 #31225713 未加载
评论 #31225704 未加载
gslin大约 3 年前
A problem is slowing down the CPU frequency significantly when AVX-512 is involved, e.g. <a href="https:&#x2F;&#x2F;en.wikichip.org&#x2F;wiki&#x2F;intel&#x2F;xeon_gold&#x2F;6262v" rel="nofollow">https:&#x2F;&#x2F;en.wikichip.org&#x2F;wiki&#x2F;intel&#x2F;xeon_gold&#x2F;6262v</a> this, which usually cancels out the benefit in the Real World (tm).
评论 #31225984 未加载
评论 #31225505 未加载
评论 #31225532 未加载
评论 #31225580 未加载
评论 #31226968 未加载
评论 #31225720 未加载
watmough大约 3 年前
This is really cool.<p>I just got through doing some work with vectorization.<p>On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.<p>With simple vectorization to AVX512 (64 unsigned chars in a vector), finding all the line breaks goes from 1.3 msec to 0.1 msec, so a little better than a 10x speedup, still just on the one core, which keeps things simple.<p>I was using Agner Fog&#x27;s VCL 2, Apache licensed C++ Vector Class Library. It&#x27;s super easy.
mdb31大约 3 年前
Cool performance enhancement, with an accompanying implementation in a real-world library (<a href="https:&#x2F;&#x2F;github.com&#x2F;lemire&#x2F;despacer" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lemire&#x2F;despacer</a>).<p>Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn&#x27;t it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?
评论 #31225206 未加载
评论 #31224843 未加载
评论 #31226717 未加载
brrrrrm大约 3 年前
What&#x27;s the generated assembly look like? I suspect clang isn&#x27;t smart enough to store things into registers. The latency of VPCOMPRESSB seems quite high (according to the table here at least <a href="https:&#x2F;&#x2F;uops.info&#x2F;table.html" rel="nofollow">https:&#x2F;&#x2F;uops.info&#x2F;table.html</a>), so you&#x27;ll probably want to induce a bit more pipelining by manually unrolling into the register variant.<p>I don&#x27;t have an AVX512 machine with VBMI2, but here&#x27;s what my untested code might look like:<p><pre><code> __m512i spaces = _mm512_set1_epi8(&#x27; &#x27;); size_t i = 0; for (; i + (64 * 4 - 1) &lt; howmany; i += 64 * 4) { &#x2F;&#x2F; 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers __m512i in0 = _mm512_loadu_si512(bytes + i); __m512i in1 = _mm512_loadu_si512(bytes + i + 64); __m512i in2 = _mm512_loadu_si512(bytes + i + 128); __m512i in3 = _mm512_loadu_si512(bytes + i + 192); __mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces); __mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces); __mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces); __mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces); auto reg0 = _mm512_maskz_compress_epi8 (mask0, x); auto reg1 = _mm512_maskz_compress_epi8 (mask1, x); auto reg2 = _mm512_maskz_compress_epi8 (mask2, x); auto reg3 = _mm512_maskz_compress_epi8 (mask3, x); _mm512_storeu_si512(bytes + pos, reg0); pos += _popcnt64(mask0); _mm512_storeu_si512(bytes + pos, reg1); pos += _popcnt64(mask1); _mm512_storeu_si512(bytes + pos, reg2); pos += _popcnt64(mask2); _mm512_storeu_si512(bytes + pos, reg3); pos += _popcnt64(mask3); } &#x2F;&#x2F; old code can go here, since it handles a smaller size well </code></pre> You can probably do better by chunking up the input and using temporary memory (coalesced at the end).
bertr4nd大约 3 年前
I love Daniel’s vectorized string processing posts. There’s always some clever trickery that’s hard for a guy like me (who mostly uses vector extensions for ML kernels) to get quickly.<p>I found myself wondering if one could create a domain-specific language for specifying string processing tasks, and then automate some of the tricks with a compiler (possibly with human-specified optimization annotations). Halide did this sort of thing for image processing (and ML via TVM to some extent) and it was a pretty significant success.
gfody大约 3 年前
there&#x27;s more whitespace above 0x20 <a href="https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;Whitespace_character#Unicode" rel="nofollow">https:&#x2F;&#x2F;en.m.wikipedia.org&#x2F;wiki&#x2F;Whitespace_character#Unicode</a>
评论 #31226272 未加载
GICodeWarrior大约 3 年前
Here&#x27;s a list of processors supporting AVX-512:<p><a href="https:&#x2F;&#x2F;ark.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;ark&#x2F;search&#x2F;featurefilter.html?productType=873&amp;1_Filter-InstructionSetExtensions=3533" rel="nofollow">https:&#x2F;&#x2F;ark.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;ark&#x2F;search&#x2F;featurefi...</a><p>The author mentions it&#x27;s difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.
tedunangst大约 3 年前
What would be a practical application of this? The linked post mentions a trim like operation, but in practice I only want to remove white space from the ends, not the interior of the string, and finding the ends is basically the whole problem. Or maybe I want to compress some json, but a simple approach won&#x27;t work because there can be spaces inside string values which must be preserved.
评论 #31229579 未加载
jquery大约 3 年前
I prefer AMDs approach that allows them to put more cores on the die instead of supporting a rarely used instruction set.
评论 #31225642 未加载
protoman3000大约 3 年前
Please correct me if I&#x27;m wrong, but wouldn&#x27;t we normally scale these things instead on a GPU?
评论 #31225755 未加载
评论 #31225676 未加载