I would do the SSE version slightly differently. No need for bitwise tricks, SSE2 has an instruction to compare bytes, which saves 1 instruction and one constant vector.<p><pre><code> int luhn( const char* s )
{
__m128i r = _mm_loadu_si128( ( const __m128i* )s );
// Decode ASCII
r = _mm_subs_epu8( r, _mm_set1_epi8( 0x30 ) );
// Double every other digit
__m128i m = _mm_set1_epi16( 0x00ff );
r = _mm_add_epi8( r, _mm_and_si128( r, m ) );
// if( digit > 9 ) digit -= 9
const __m128i nine = _mm_set1_epi8( 9 );
__m128i gt = _mm_cmpgt_epi8( r, nine );
r = _mm_sub_epi8( r, _mm_and_si128( gt, nine ) );
// Horizontal sum
r = _mm_sad_epu8( r, _mm_setzero_si128() );
r = _mm_add_epi64( r, _mm_srli_si128( r, 8 ) );
return _mm_cvtsi128_si32( r ) % 10;
}</code></pre>
This is so stupid, I love it!<p>Its actually a great demonstration of efficient parallel processing on something totally not needing it. But you've got your "prefix-sum" pattern (maybe called reduction-pattern in some other code, or even "scan" in yet other code).<p>The only problem is that there's only 16-digits, so parallel processing of this is barely an upgrade at all. But you can imagine that if we scaled credit-card numbers up to 1024 digits or 1-million digits (1 MB), this algorithm would scale very well.
I’d make it an inline function and call it multiple times per loop iteration. You should be able to pull some of the initializations out of the loop and fetch multiple at once, pipelining using different registers. Optimizing compilers have gotten good enough to do the right things much of the time with intrinsics but it still may benefit from hand writing.<p>Sometimes you can even eek a touch more speed with prefetch, but it can be tricky to get that right.<p>Another thing that matters a ton is whether you’re microbenchmarking it accurately. It isn’t trivial to do so, as you should mimic real memory / cache behavior as opposed to repeatedly having it fetch memory which is already cached and give you misleading results.