Regarding "A final optimization", there's another way:<p>`_mm256_cmpeq_epi32` produces `0xFFFFFFFF`, the article suggests shifting to produce a `1` and then add.<p>Instead you can interpret `0xFFFFFFFF` as negative one, and subtract. That saves a shift.<p>Flip the sign when accumulating.<p>In general I think this is a pretty common counting trick. I don't think those shift operations even exists for epi8, so there you really need to use it to avoid reduction to a narrow register. Also, in the case of epi8 you need to deal with overflow, so the pattern is like this in pseudo code:<p><pre><code> v[1:32] = 0
total = 0
for j = 0 to N / 256
for i = 0 to 255
v[1:32] -= cmpeq(..., ...)
end
total += sum(v)
end</code></pre>