TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Apple’s M1 processor and the full 128-bit integer product

230 pointsby tgymnichabout 4 years ago

13 comments

titzerabout 4 years ago
Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle <i>on average</i>. 1 cycle <i>on average</i> for any kind of loop is just flat out suspicious.<p>This requires a lot more digging to understand.<p>Simply put, I don&#x27;t accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.
评论 #26500541 未加载
评论 #26499843 未加载
评论 #26498774 未加载
评论 #26499317 未加载
评论 #26498819 未加载
评论 #26498773 未加载
p1mrxabout 4 years ago
RISC-V does this too: <a href="https:&#x2F;&#x2F;five-embeddev.com&#x2F;riscv-isa-manual&#x2F;latest&#x2F;m.html" rel="nofollow">https:&#x2F;&#x2F;five-embeddev.com&#x2F;riscv-isa-manual&#x2F;latest&#x2F;m.html</a><p>&quot;If both the high and low bits of the same product are required, then the recommended code sequence is [...]. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.&quot;
namibjabout 4 years ago
For the interested, LLVM-MCA says this<p><pre><code> Iterations: 10000 Instructions: 100000 Total Cycles: 25011 Total uOps: 100000 Dispatch Width: 4 uOps Per Cycle: 4.00 IPC: 4.00 Block RThroughput: 2.5 No resource or data dependency bottlenecks discovered. </code></pre> , which to me seems like 2.5 cycles per iteration (on Zen3). Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.<p>For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):<p><pre><code> .LBB5_2: # =&gt;This Inner Loop Header: Depth=1 mov rdx, r11 add r11, r8 mulx rdx, rax, r9 xor rdx, rax mulx rdx, rax, r10 xor rdx, rax mov qword ptr [rdi + 8*rcx], rdx add rcx, 2 cmp rcx, rsi jb .LBB5_2</code></pre>
theszabout 4 years ago
Multiplier in M1 can be pipelined or relicated (or both), so issuing two instructions can be as fast as issuing one.<p>Instruction recognition logic (DAG analysis, BTW) is harder to implement than to implement pipelined multiplier. Former is a research project, while latter was done at the dawn of computing.
tzsabout 4 years ago
I wonder if order matters? That is, would mul followed by mulh be the same speed as mulh followed by mul?<p>How about if there is an instruction between them that does not do arithmetic? (What I&#x27;m wondering here is if the processor recognizes the specific two instruction sequence, or if it something more general like mul internally producing the full 128 bits, returning the lower 64, and caching the upper 64 bits somewhere so this if there is a mulh before something overwrites that cache it can use it).
评论 #26500258 未加载
thefourthchimeabout 4 years ago
I love my M1, but does anyone else have horrific performance when resuming from wake? It’s like it swaps everything to disk and takes a full minute to come back to life.
评论 #26498670 未加载
评论 #26498617 未加载
评论 #26498595 未加载
评论 #26498775 未加载
评论 #26499398 未加载
评论 #26498581 未加载
评论 #26498543 未加载
评论 #26499784 未加载
评论 #26498647 未加载
评论 #26498563 未加载
评论 #26498566 未加载
评论 #26498532 未加载
评论 #26498591 未加载
评论 #26498542 未加载
评论 #26507946 未加载
评论 #26499216 未加载
评论 #26498578 未加载
acjeabout 4 years ago
So this is why the integer multiply accumulate instruction mullah, only delivers the most significant bits? Ironic if you aren&#x27;t religious about these things.
yuhongabout 4 years ago
I believe that ARMv8 NEON crypto extensions has a special instruction for 64-bit multiply to 128-bit product, which is useful for Monero mining for example.
Daho0nabout 4 years ago
The amount of bugs in the M1 and MacOS posted on HN in a week could keep developers working for months at Apple.
mmaunderabout 4 years ago
A common misconception about RISC processors.
评论 #26498783 未加载
LAMikeabout 4 years ago
Anyone want to take a guess at how long it will be until Apple has their own fab in the US making M1 chips?
评论 #26498642 未加载
评论 #26498841 未加载
评论 #26498876 未加载
评论 #26498992 未加载
评论 #26498760 未加载
评论 #26498695 未加载
评论 #26498922 未加载
评论 #26498734 未加载
评论 #26498782 未加载
评论 #26498652 未加载
ben_baiabout 4 years ago
That&#x27;s great if you App is compute bound. &quot;May all your Processes be compute bound.&quot; Back in the real world most of the time your Process will be io bound. I think that&#x27;s the real innovation of the M1 chip.
评论 #26501113 未加载
评论 #26500458 未加载
评论 #26501333 未加载
zelon88about 4 years ago
You mean to tell me that a $2000 Macbook is almost as performant as a $1000 PC? Tell me more!
评论 #26501584 未加载
评论 #26500386 未加载