Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle <i>on average</i>. 1 cycle <i>on average</i> for any kind of loop is just flat out suspicious.<p>This requires a lot more digging to understand.<p>Simply put, I don't accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.
RISC-V does this too: <a href="https://five-embeddev.com/riscv-isa-manual/latest/m.html" rel="nofollow">https://five-embeddev.com/riscv-isa-manual/latest/m.html</a><p>"If both the high and low bits of the same product are required, then the recommended code sequence is [...]. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies."
For the interested, LLVM-MCA says this<p><pre><code> Iterations: 10000
Instructions: 100000
Total Cycles: 25011
Total uOps: 100000
Dispatch Width: 4
uOps Per Cycle: 4.00
IPC: 4.00
Block RThroughput: 2.5
No resource or data dependency bottlenecks discovered.
</code></pre>
, which to me seems like 2.5 cycles per iteration (on Zen3).
Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.<p>For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):<p><pre><code> .LBB5_2: # =>This Inner Loop Header: Depth=1
mov rdx, r11
add r11, r8
mulx rdx, rax, r9
xor rdx, rax
mulx rdx, rax, r10
xor rdx, rax
mov qword ptr [rdi + 8*rcx], rdx
add rcx, 2
cmp rcx, rsi
jb .LBB5_2</code></pre>
Multiplier in M1 can be pipelined or relicated (or both), so issuing two instructions can be as fast as issuing one.<p>Instruction recognition logic (DAG analysis, BTW) is harder to implement than to implement pipelined multiplier. Former is a research project, while latter was done at the dawn of computing.
I wonder if order matters? That is, would mul followed by mulh be the same speed as mulh followed by mul?<p>How about if there is an instruction between them that does not do arithmetic? (What I'm wondering here is if the processor recognizes the specific two instruction sequence, or if it something more general like mul internally producing the full 128 bits, returning the lower 64, and caching the upper 64 bits somewhere so this if there is a mulh before something overwrites that cache it can use it).
I love my M1, but does anyone else have horrific performance when resuming from wake? It’s like it swaps everything to disk and takes a full minute to come back to life.
So this is why the integer multiply accumulate instruction mullah, only delivers the most significant bits? Ironic if you aren't religious about these things.
I believe that ARMv8 NEON crypto extensions has a special instruction for 64-bit multiply to 128-bit product, which is useful for Monero mining for example.
That's great if you App is compute bound. "May all your Processes be compute bound." Back in the real world most of the time your Process will be io bound. I think that's the real innovation of the M1 chip.