It's interesting that ARM32 has conditional execution and I like them a lot for writing readable assembly code. Short jumps that result from a simple if can be encoded in three successive instructions, no branches.<p>However, it's now falling out of favor (mostly gone from ARM 64) and apparently it's due to the relative cost of putting conditional execution on the die vs. relying on smarter compilers.
One simple branchless optimization form I've used is collision detection across an array of values: instead of testing each one, i add their value to a counter(perhaps with some mapping of data to collision value). After iterating over a lot of them, I can do just one test. This is very cpu-friendly as the pipeline gets to crunch all the numbers in one go.
One example seems a bit odd.<p><pre><code> if(LocalVariable & 0x00001000)
return 1;
else
return 0;
mov eax, [ebp - 10]
and eax, 0x00001000
neg eax
sbb eax, eax
neg eax
ret
</code></pre>
Hmm... wouldn't this be faster? Two instructions less:<p><pre><code> mov eax, [ebp - 10]
and eax, 0x00001000
shr eax, 12
ret
</code></pre>
Well, who knows. Didn't bother to analyze this case. Maybe the article's example is faster somehow?
Nice article. It inspired me to look around for some more straightforward way of optimizing, and I found the setcc class of instructions: <a href="http://www.nynaeve.net/?p=178" rel="nofollow">http://www.nynaeve.net/?p=178</a><p>I'm thinking that this combined with some CAS (CMPXCHG8B) could acheive the same, right?<p>Something like (pseudo):<p>Comparewith(4)<p>Ifequalstore(54)<p>Ifnotequalstore(2)<p>Return
I know that gcc and clang both have __builtin_expect(). If you tell the compiler the more likely path, wouldn't that make the branching version faster?<p>Actually, I've always wondered how __builtin_expect translates to something the CPU's branch prediction engine can use...