TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Luhn algorithm using SWAR and SIMD

19 pointsby harporoederabout 3 years ago

3 comments

Const-meabout 3 years ago
I would do the SSE version slightly differently. No need for bitwise tricks, SSE2 has an instruction to compare bytes, which saves 1 instruction and one constant vector.<p><pre><code> int luhn( const char* s ) { __m128i r = _mm_loadu_si128( ( const __m128i* )s ); &#x2F;&#x2F; Decode ASCII r = _mm_subs_epu8( r, _mm_set1_epi8( 0x30 ) ); &#x2F;&#x2F; Double every other digit __m128i m = _mm_set1_epi16( 0x00ff ); r = _mm_add_epi8( r, _mm_and_si128( r, m ) ); &#x2F;&#x2F; if( digit &gt; 9 ) digit -= 9 const __m128i nine = _mm_set1_epi8( 9 ); __m128i gt = _mm_cmpgt_epi8( r, nine ); r = _mm_sub_epi8( r, _mm_and_si128( gt, nine ) ); &#x2F;&#x2F; Horizontal sum r = _mm_sad_epu8( r, _mm_setzero_si128() ); r = _mm_add_epi64( r, _mm_srli_si128( r, 8 ) ); return _mm_cvtsi128_si32( r ) % 10; }</code></pre>
dragontamerabout 3 years ago
This is so stupid, I love it!<p>Its actually a great demonstration of efficient parallel processing on something totally not needing it. But you&#x27;ve got your &quot;prefix-sum&quot; pattern (maybe called reduction-pattern in some other code, or even &quot;scan&quot; in yet other code).<p>The only problem is that there&#x27;s only 16-digits, so parallel processing of this is barely an upgrade at all. But you can imagine that if we scaled credit-card numbers up to 1024 digits or 1-million digits (1 MB), this algorithm would scale very well.
emergedabout 3 years ago
I’d make it an inline function and call it multiple times per loop iteration. You should be able to pull some of the initializations out of the loop and fetch multiple at once, pipelining using different registers. Optimizing compilers have gotten good enough to do the right things much of the time with intrinsics but it still may benefit from hand writing.<p>Sometimes you can even eek a touch more speed with prefetch, but it can be tricky to get that right.<p>Another thing that matters a ton is whether you’re microbenchmarking it accurately. It isn’t trivial to do so, as you should mimic real memory &#x2F; cache behavior as opposed to repeatedly having it fetch memory which is already cached and give you misleading results.