OK, I got nerd-sniped here. You can actually construct the indices for the shuffle fairly easily with PEXT. Basically, you have 6 64-bit masks, each corresponding to a different bit of the index of each byte in the 64-byte vector. So for mask 0, a bit is set in the mask if its index has bit (1 << 0) set, mask 1 has the same but for bit (1 << 1), etc. The masks have a simple pattern, that changes between 1 and 0 bits every (1 << i) bits. So for 3 bits the masks would be: 10101010, 11001100, 11110000.<p>These masks are then extracted with PEXT for all the non-whitespace bytes. What this does is build up, bit by bit, the byte index of every non-whitespace byte, compressed down to the least-significant end, without the indices of whitespace bytes.<p>I wasn't actually able to run this code, since I don't have an AVX-512 machine, but I'm pretty sure it should be faster. I put the code on github if anyone wants to try: <a href="https://github.com/zwegner/toys/blob/master/avx512-remove-spaces/avx512vbmi.cpp" rel="nofollow">https://github.com/zwegner/toys/blob/master/avx512-remove-sp...</a><p><pre><code> const uint64_t index_masks[6] = {
0xaaaaaaaaaaaaaaaa,
0xcccccccccccccccc,
0xf0f0f0f0f0f0f0f0,
0xff00ff00ff00ff00,
0xffff0000ffff0000,
0xffffffff00000000,
};
const __m512i index_bits[6] = {
_mm512_set1_epi8(1),
_mm512_set1_epi8(2),
_mm512_set1_epi8(4),
_mm512_set1_epi8(8),
_mm512_set1_epi8(16),
_mm512_set1_epi8(32),
};
...later, inside the loop:
mask = ~mask;
__m512i indices = _mm512_set1_epi8(0);
for (size_t index = 0; index < 6; index++) {
uint64_t m = _pext_u64(index_masks[index], mask);
indices = _mm512_mask_add_epi8(indices, m, indices, index_bits[index]);
}
output = _mm512_permutexvar_epi8(indices, input);</code></pre>