> mov instructions from register to register make up for more than 60% of the time spent in the critical section of the code, while we would expect most of the time to be spent xoring and anding. I have not investigated why this is the case, ideas welcome<p>If you're not using precise events, then the instruction addresses reported by perf will have some skid. This is a small cpu delay from when a performance counter overflows to when the interrupt actually freezes state.<p>You can choose precise sampling for some events, depending on the CPU. Try "-e cycles:pp" for instance.<p><pre><code> 0,09 │ mov (%rax,%r8,4),%eax
29,32 │ mov %r14,%r8
</code></pre>
I think this first mov from memory is likely to be your true cycle eater, much more so than the second mov reg-reg or any single xor/and operations. But don't optimize based on my hunch - measure it precisely first! If memory access proves to be your slowdown, then you can try optimizing your access patterns.
I wonder how many veteran C programmers (myself not included) would react with "d'oh, of course you should byte-align memory access" here...