TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Apple’s M1 processor and the full 128-bit integer product

230 点作者 tgymnich大约 4 年前

13 条评论

titzer大约 4 年前
Did anyone actually look at the machine code generated here? 0.30ns per value? That is basically 1 cycle. Of course, there is no way that a processor can compute so many dependent instructions in one cycle, simply because they generate so many dependent micro-ops, and every micro-op is at least one cycle to go through an execution unit. So this must mean that either the compiler is unrolling the (benchmarking) loop, or the processor is speculating many loop iterations into the future, so that the latencies can be overlapped and it works out to 1 cycle <i>on average</i>. 1 cycle <i>on average</i> for any kind of loop is just flat out suspicious.<p>This requires a lot more digging to understand.<p>Simply put, I don&#x27;t accept the hastily arrived-at conclusion, and wish Daniel would put more effort into investigation in the future. This experiment is a poor example of how to investigate performance on small kernels. You should be looking at the assembly code output by the compiler at this point instead of spitballing.
评论 #26500541 未加载
评论 #26499843 未加载
评论 #26498774 未加载
评论 #26499317 未加载
评论 #26498819 未加载
评论 #26498773 未加载
p1mrx大约 4 年前
RISC-V does this too: <a href="https:&#x2F;&#x2F;five-embeddev.com&#x2F;riscv-isa-manual&#x2F;latest&#x2F;m.html" rel="nofollow">https:&#x2F;&#x2F;five-embeddev.com&#x2F;riscv-isa-manual&#x2F;latest&#x2F;m.html</a><p>&quot;If both the high and low bits of the same product are required, then the recommended code sequence is [...]. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.&quot;
namibj大约 4 年前
For the interested, LLVM-MCA says this<p><pre><code> Iterations: 10000 Instructions: 100000 Total Cycles: 25011 Total uOps: 100000 Dispatch Width: 4 uOps Per Cycle: 4.00 IPC: 4.00 Block RThroughput: 2.5 No resource or data dependency bottlenecks discovered. </code></pre> , which to me seems like 2.5 cycles per iteration (on Zen3). Tigerlake is a bit worse, at about 3 cycles per iteration, due to running more uOPs per iteration, by the looks of it.<p>For the following loop core (extracted from `clang -O3 -march=znver3`, using trunk (5a8d5a2859d9bb056083b343588a2d87622e76a2)):<p><pre><code> .LBB5_2: # =&gt;This Inner Loop Header: Depth=1 mov rdx, r11 add r11, r8 mulx rdx, rax, r9 xor rdx, rax mulx rdx, rax, r10 xor rdx, rax mov qword ptr [rdi + 8*rcx], rdx add rcx, 2 cmp rcx, rsi jb .LBB5_2</code></pre>
thesz大约 4 年前
Multiplier in M1 can be pipelined or relicated (or both), so issuing two instructions can be as fast as issuing one.<p>Instruction recognition logic (DAG analysis, BTW) is harder to implement than to implement pipelined multiplier. Former is a research project, while latter was done at the dawn of computing.
tzs大约 4 年前
I wonder if order matters? That is, would mul followed by mulh be the same speed as mulh followed by mul?<p>How about if there is an instruction between them that does not do arithmetic? (What I&#x27;m wondering here is if the processor recognizes the specific two instruction sequence, or if it something more general like mul internally producing the full 128 bits, returning the lower 64, and caching the upper 64 bits somewhere so this if there is a mulh before something overwrites that cache it can use it).
评论 #26500258 未加载
thefourthchime大约 4 年前
I love my M1, but does anyone else have horrific performance when resuming from wake? It’s like it swaps everything to disk and takes a full minute to come back to life.
评论 #26498670 未加载
评论 #26498617 未加载
评论 #26498595 未加载
评论 #26498775 未加载
评论 #26499398 未加载
评论 #26498581 未加载
评论 #26498543 未加载
评论 #26499784 未加载
评论 #26498647 未加载
评论 #26498563 未加载
评论 #26498566 未加载
评论 #26498532 未加载
评论 #26498591 未加载
评论 #26498542 未加载
评论 #26507946 未加载
评论 #26499216 未加载
评论 #26498578 未加载
acje大约 4 年前
So this is why the integer multiply accumulate instruction mullah, only delivers the most significant bits? Ironic if you aren&#x27;t religious about these things.
yuhong大约 4 年前
I believe that ARMv8 NEON crypto extensions has a special instruction for 64-bit multiply to 128-bit product, which is useful for Monero mining for example.
Daho0n大约 4 年前
The amount of bugs in the M1 and MacOS posted on HN in a week could keep developers working for months at Apple.
mmaunder大约 4 年前
A common misconception about RISC processors.
评论 #26498783 未加载
LAMike大约 4 年前
Anyone want to take a guess at how long it will be until Apple has their own fab in the US making M1 chips?
评论 #26498642 未加载
评论 #26498841 未加载
评论 #26498876 未加载
评论 #26498992 未加载
评论 #26498760 未加载
评论 #26498695 未加载
评论 #26498922 未加载
评论 #26498734 未加载
评论 #26498782 未加载
评论 #26498652 未加载
ben_bai大约 4 年前
That&#x27;s great if you App is compute bound. &quot;May all your Processes be compute bound.&quot; Back in the real world most of the time your Process will be io bound. I think that&#x27;s the real innovation of the M1 chip.
评论 #26501113 未加载
评论 #26500458 未加载
评论 #26501333 未加载
zelon88大约 4 年前
You mean to tell me that a $2000 Macbook is almost as performant as a $1000 PC? Tell me more!
评论 #26501584 未加载
评论 #26500386 未加载