For the interested, LLVM-MCA says this<p><pre><code> Iterations: 10000
Instructions: 100000
Total Cycles: 25011
Total uOps: 100000
Dispatch Width: 4
uOps Per Cycle: 4.00
IPC: 4.00
Block RThroughput: 2.5
No resource or data dependency bottlenecks discovered.
</code></pre>
, which to me seems like 2.5 cycles per iteration (on Zen3).
Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.<p>For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):<p><pre><code> .LBB5_2: # =>This Inner Loop Header: Depth=1
mov rdx, r11
add r11, r8
mulx rdx, rax, r9
xor rdx, rax
mulx rdx, rax, r10
xor rdx, rax
mov qword ptr [rdi + 8*rcx], rdx
add rcx, 2
cmp rcx, rsi
jb .LBB5_2</code></pre>