Intel is removing AVX-512 support from their newer CPU's (Alder Lake +). :/<p><a href="https://www.igorslab.de/en/intel-deactivated-avx-512-on-alder-lake-but-fully-questionable-interpretation-of-efficiency-news-editorial/" rel="nofollow">https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...</a>
A problem is slowing down the CPU frequency significantly when AVX-512 is involved, e.g. <a href="https://en.wikichip.org/wiki/intel/xeon_gold/6262v" rel="nofollow">https://en.wikichip.org/wiki/intel/xeon_gold/6262v</a> this, which usually cancels out the benefit in the Real World (tm).
This is really cool.<p>I just got through doing some work with vectorization.<p>On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.<p>With simple vectorization to AVX512 (64 unsigned chars in a vector), finding all the line breaks goes from 1.3 msec to 0.1 msec, so a little better than a 10x speedup, still just on the one core, which keeps things simple.<p>I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class Library. It's super easy.
Cool performance enhancement, with an accompanying implementation in a real-world library (<a href="https://github.com/lemire/despacer" rel="nofollow">https://github.com/lemire/despacer</a>).<p>Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?
What's the generated assembly look like? I suspect clang isn't smart enough to store things into registers. The latency of VPCOMPRESSB seems quite high (according to the table here at least <a href="https://uops.info/table.html" rel="nofollow">https://uops.info/table.html</a>), so you'll probably want to induce a bit more pipelining by manually unrolling into the register variant.<p>I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:<p><pre><code> __m512i spaces = _mm512_set1_epi8(' ');
size_t i = 0;
for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) {
// 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers
__m512i in0 = _mm512_loadu_si512(bytes + i);
__m512i in1 = _mm512_loadu_si512(bytes + i + 64);
__m512i in2 = _mm512_loadu_si512(bytes + i + 128);
__m512i in3 = _mm512_loadu_si512(bytes + i + 192);
__mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces);
__mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces);
__mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces);
__mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces);
auto reg0 = _mm512_maskz_compress_epi8 (mask0, x);
auto reg1 = _mm512_maskz_compress_epi8 (mask1, x);
auto reg2 = _mm512_maskz_compress_epi8 (mask2, x);
auto reg3 = _mm512_maskz_compress_epi8 (mask3, x);
_mm512_storeu_si512(bytes + pos, reg0);
pos += _popcnt64(mask0);
_mm512_storeu_si512(bytes + pos, reg1);
pos += _popcnt64(mask1);
_mm512_storeu_si512(bytes + pos, reg2);
pos += _popcnt64(mask2);
_mm512_storeu_si512(bytes + pos, reg3);
pos += _popcnt64(mask3);
}
// old code can go here, since it handles a smaller size well
</code></pre>
You can probably do better by chunking up the input and using temporary memory (coalesced at the end).
I love Daniel’s vectorized string processing posts. There’s always some clever trickery that’s hard for a guy like me (who mostly uses vector extensions for ML kernels) to get quickly.<p>I found myself wondering if one could create a domain-specific language for specifying string processing tasks, and then automate some of the tricks with a compiler (possibly with human-specified optimization annotations). Halide did this sort of thing for image processing (and ML via TVM to some extent) and it was a pretty significant success.
Here's a list of processors supporting AVX-512:<p><a href="https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&1_Filter-InstructionSetExtensions=3533" rel="nofollow">https://ark.intel.com/content/www/us/en/ark/search/featurefi...</a><p>The author mentions it's difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.
What would be a practical application of this? The linked post mentions a trim like operation, but in practice I only want to remove white space from the ends, not the interior of the string, and finding the ends is basically the whole problem. Or maybe I want to compress some json, but a simple approach won't work because there can be spaces inside string values which must be preserved.