科技回声

11 条评论

Andoryuuta大约 3 年前

Intel is removing AVX-512 support from their newer CPU's (Alder Lake +). :/<a href="https://www.igorslab.de/en/intel-deactivated-avx-512-on-alder-lake-but-fully-questionable-interpretation-of-efficiency-news-editorial/" rel="nofollow">https://www.igorslab.de/en/intel-deactivated-avx-512-on-alde...</a>

评论 #31225783 未加载

评论 #31225694 未加载

评论 #31225713 未加载

评论 #31225704 未加载

gslin大约 3 年前

A problem is slowing down the CPU frequency significantly when AVX-512 is involved, e.g. <a href="https://en.wikichip.org/wiki/intel/xeon_gold/6262v" rel="nofollow">https://en.wikichip.org/wiki/intel/xeon_gold/6262v</a> this, which usually cancels out the benefit in the Real World (tm).

评论 #31225984 未加载

评论 #31225505 未加载

评论 #31225532 未加载

评论 #31225580 未加载

评论 #31226968 未加载

评论 #31225720 未加载

watmough大约 3 年前

This is really cool.I just got through doing some work with vectorization.On the simplest workload I have, splitting a 3 MByte text file into lines, writing a pointer to each string to an array, GCC will not vectorize the naive loop, though ICC might I guess.With simple vectorization to AVX512 (64 unsigned chars in a vector), finding all the line breaks goes from 1.3 msec to 0.1 msec, so a little better than a 10x speedup, still just on the one core, which keeps things simple.I was using Agner Fog's VCL 2, Apache licensed C++ Vector Class Library. It's super easy.

mdb31大约 3 年前

Cool performance enhancement, with an accompanying implementation in a real-world library (<a href="https://github.com/lemire/despacer" rel="nofollow">https://github.com/lemire/despacer</a>).Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?

评论 #31225206 未加载

评论 #31224843 未加载

评论 #31226717 未加载

brrrrrm大约 3 年前

What's the generated assembly look like? I suspect clang isn't smart enough to store things into registers. The latency of VPCOMPRESSB seems quite high (according to the table here at least <a href="https://uops.info/table.html" rel="nofollow">https://uops.info/table.html</a>), so you'll probably want to induce a bit more pipelining by manually unrolling into the register variant.I don't have an AVX512 machine with VBMI2, but here's what my untested code might look like:<pre><code> __m512i spaces = _mm512_set1_epi8(' '); size_t i = 0; for (; i + (64 * 4 - 1) < howmany; i += 64 * 4) { // 4 input regs, 4 output regs, you can actually do up to 8 because there are 8 mask registers __m512i in0 = _mm512_loadu_si512(bytes + i); __m512i in1 = _mm512_loadu_si512(bytes + i + 64); __m512i in2 = _mm512_loadu_si512(bytes + i + 128); __m512i in3 = _mm512_loadu_si512(bytes + i + 192); __mmask64 mask0 = _mm512_cmpgt_epi8_mask (in0, spaces); __mmask64 mask1 = _mm512_cmpgt_epi8_mask (in1, spaces); __mmask64 mask2 = _mm512_cmpgt_epi8_mask (in2, spaces); __mmask64 mask3 = _mm512_cmpgt_epi8_mask (in3, spaces); auto reg0 = _mm512_maskz_compress_epi8 (mask0, x); auto reg1 = _mm512_maskz_compress_epi8 (mask1, x); auto reg2 = _mm512_maskz_compress_epi8 (mask2, x); auto reg3 = _mm512_maskz_compress_epi8 (mask3, x); _mm512_storeu_si512(bytes + pos, reg0); pos += _popcnt64(mask0); _mm512_storeu_si512(bytes + pos, reg1); pos += _popcnt64(mask1); _mm512_storeu_si512(bytes + pos, reg2); pos += _popcnt64(mask2); _mm512_storeu_si512(bytes + pos, reg3); pos += _popcnt64(mask3); } // old code can go here, since it handles a smaller size well </code></pre> You can probably do better by chunking up the input and using temporary memory (coalesced at the end).

bertr4nd大约 3 年前

I love Daniel’s vectorized string processing posts. There’s always some clever trickery that’s hard for a guy like me (who mostly uses vector extensions for ML kernels) to get quickly.I found myself wondering if one could create a domain-specific language for specifying string processing tasks, and then automate some of the tricks with a compiler (possibly with human-specified optimization annotations). Halide did this sort of thing for image processing (and ML via TVM to some extent) and it was a pretty significant success.

gfody大约 3 年前

there's more whitespace above 0x20 <a href="https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode" rel="nofollow">https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode</a>

评论 #31226272 未加载

GICodeWarrior大约 3 年前

Here's a list of processors supporting AVX-512:<a href="https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&1_Filter-InstructionSetExtensions=3533" rel="nofollow">https://ark.intel.com/content/www/us/en/ark/search/featurefi...</a>The author mentions it's difficult to identify which features are supported on which processor, but ark.intel.com has a quite good catalog.

tedunangst大约 3 年前

What would be a practical application of this? The linked post mentions a trim like operation, but in practice I only want to remove white space from the ends, not the interior of the string, and finding the ends is basically the whole problem. Or maybe I want to compress some json, but a simple approach won't work because there can be spaces inside string values which must be preserved.

评论 #31229579 未加载

jquery大约 3 年前

I prefer AMDs approach that allows them to put more cores on the die instead of supporting a rarely used instruction set.

评论 #31225642 未加载

protoman3000大约 3 年前

Please correct me if I'm wrong, but wouldn't we normally scale these things instead on a GPU?

评论 #31225755 未加载

评论 #31225676 未加载

11 条评论

Andoryuuta大约 3 年前

评论 #31225783 未加载

评论 #31225694 未加载

评论 #31225713 未加载

评论 #31225704 未加载

gslin大约 3 年前

评论 #31225984 未加载

评论 #31225505 未加载

评论 #31225532 未加载

评论 #31225580 未加载

评论 #31226968 未加载

评论 #31225720 未加载

watmough大约 3 年前

mdb31大约 3 年前

评论 #31225206 未加载

评论 #31224843 未加载

评论 #31226717 未加载

brrrrrm大约 3 年前

bertr4nd大约 3 年前

gfody大约 3 年前

there's more whitespace above 0x20 <a href="https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode" rel="nofollow">https://en.m.wikipedia.org/wiki/Whitespace_character#Unicode</a>

评论 #31226272 未加载

GICodeWarrior大约 3 年前

tedunangst大约 3 年前

评论 #31229579 未加载

jquery大约 3 年前

I prefer AMDs approach that allows them to put more cores on the die instead of supporting a rarely used instruction set.

评论 #31225642 未加载

protoman3000大约 3 年前

Please correct me if I'm wrong, but wouldn't we normally scale these things instead on a GPU?

评论 #31225755 未加载

评论 #31225676 未加载

Removing characters from strings faster with AVX-512

11 条评论

Removing characters from strings faster with AVX-512

11 条评论