This is a very artificial sort of task. You have to visit all bits down to the least significant 1 bit, and ignore the rest? In designing the system, since the assignment of bits to meaning is probably arbitrary, it is smarter to assign them in the way that is more convenient and less error-prone to code against, arranging so the 1 bits may be visited least-significant first.<p>But, OK.<p>(It is probably important, here, to call more attention to the fact that shifting a <i>signed</i> value left until its sign changes would be Undefined Behavior, making your optimizer likely to do something crazy. That is why the code sample must cast a shifted uint64_t to an int64_t before comparing to zero; that conversion is well-defined. Reader be advised!)<p>In the final, indexing, example, the code the compiler generates is to shift the word right 63 places, instead of checking the sign, to make the array index. You might as well do that in the source code, instead of coding a check of the sign bit that the compiler will convert to what you really meant anyway.<p>The final trick of indexing an array to get the right function to call is a good one, and could be a lot faster if the bit patterns are close to random. Branch predictors are very good at teasing out patterns, but the penalty for guessing wrong is very high. A possibly better way, particularly on clang, would look more like<p><pre><code> (mask >> 63 ? func_true : func_false)(i);
</code></pre>
Modern chips have a "cmov" instruction that is likely to be better than indexing into an array, with its a three-cycle trip out to L1 cache: on a modern chip, that is enough time to execute 18 other instructions!<p>In the loop, there is no point in comparing i to zero, because you are going to break out when the mask hits zero anyway. So, that is just wasted code; deleting it would eliminate the useless "dec %ebp; js 108" instructions. That branch is reliably predicted, so mostly just costs cache footprint, not cycles. But the micro-operations cache is very small, and if the extra instructions make you blow your micro-op cache, that could be pretty expensive.<p>Finally, and <i>most important</i>: the number of instructions is an extremely poor way to estimate performance. Often, many instructions take less time to run than a few, different instructions. So, to make a smart choice about performance, or to present <i>actually meaningful</i> results in your blog post, you need to set up a micro-benchmark and actually see what really happens, and not just guess by counting lines. There are extremely good micro-benchmark tools available free online. Google's is as good as any. Or you could paste the code into Godbolt.org and benchmark it there.<p>So, in the end, this <i>might</i> be best:<p><pre><code> void loop_v4(uint64_t mask) {
for (int i=63; mask; mask <<= 1, --i)
(mask >> 63 ? func_true : func_false)(i);
}
</code></pre>
or possibly<p><pre><code> (mask & ~(-1ull>>1) ? func_true : func_false)(i);
</code></pre>
or<p><pre><code> ((int64_t)mask < 0 ? func_true : func_false)(i);
</code></pre>
But again, you need to measure (which I didn't do either). Results might be very, very different between x86 and ARM. Of course the results will depend on what happens in func_true and in func_false, too. And, you might find this is altogether the wrong place spend your optimizing attention.