TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Exploring the scalable matrix extension of the Apple M4 processor

184 pointsby gok8 months ago

9 comments

rerdavies8 months ago
In my experience, based on profiling and optimizing of ML-based guitar amp models in the PiPedal project (<a href="https:&#x2F;&#x2F;rerdavies.github.io&#x2F;pipedal&#x2F;" rel="nofollow">https:&#x2F;&#x2F;rerdavies.github.io&#x2F;pipedal&#x2F;</a>), when using only neon instructions, performance is almost completely constrained by L2 memory bandwidth. Compute cost almost completely disappear while waiting for memory loads and stores.<p>So, although these devices have ferociously impressive FLOP rates, I&#x27;m extremely curious as to how the cost of memory loads and stores is going to work.<p>I can very well imagine that having large local tile buffers is going to dramatically improve performance. But I&#x27;m curious how much. No matter how fast the compute speed is, it seems to me that performance of these sorts of devices in practice is going to be constrained by memory transfer rates. And perhaps by L1 caches in the tile compute unit that are better optimized for tile computation than the L1 cache on a general-purpose cPU.<p>My current expectation: that performance of matrix multiplies increases linearly with respect to tile size. i.e. a tile size if 8x8 floats will perform twice as fast as a matrix multiplier with a tile size of 4x4, since doubling the tile size reduces the required transfers to and from L2 by a factor of two.<p>So, compared to a basic A72 ARM neon (effectively, 4x8 tile size), I would expect about a 4x improvement by virtue of the fact that the tile size is larger on the Apple tile processor. Both entirely otherwise limited by the cost of L2 memory loads and stores. And maybe another 2x or 3x improvement because the tile processor L1 caches (tile buffers) are tuned for tile multiply&#x2F;accumulate operations.<p>Could somebody comment on how these devices actually perform on real matrix multiplies? It seems inconceivable to me that these devices will actually achieve peak FLOP rates in anything but meaningless test cases. And also somewhat of a meaningless exercise to measure peak performance using test cases that are designed to completely eliminate L2 memory transfers.
dividuum8 months ago
&gt; Although Apple has included a matrix accelerator in its devices since 2019, it used a proprietary instruction set inaccessible to developers, who officially could only use Apple-provided numerical libraries.<p>How does that work? Does the hardware throw some kind of fault when using those instructions? Or are they merely undocumented and you could use them if you figure out how they work? I guess the second, as hinted by the &quot;officially&quot;?
评论 #41531222 未加载
评论 #41531385 未加载
评论 #41530903 未加载
freeqaz8 months ago
Any comparison with how much faster this is compared with the previous way of doing things on the CPU?
评论 #41506894 未加载
nxobject8 months ago
If Apple’s going for one SME accelerator per base M4 chiplet, it’ll be interesting to see how to program scalably for Pro&#x2F;Max&#x2F;Ultra variants.
评论 #41531687 未加载
kjkjadksj8 months ago
I wish they made computers that ran software like games again. Seems like the last few iterations they’ve been working hard on making computers that are able to run ai models a little faster. Are people really asking for that? I would think far more people would like to play a video game over rolling their own matrix multiplication, but I guess that’s why they pay the people at apple the big bucks because they must know best.
评论 #41532535 未加载
评论 #41534448 未加载
评论 #41535043 未加载
评论 #41535626 未加载
评论 #41533143 未加载
评论 #41536604 未加载
评论 #41535228 未加载
评论 #41533006 未加载
ein0p8 months ago
I’m not sure why they added this feature. All Apple SoCs have far more energy efficient compute than the CPU. This would only make sense for really tiny models which need extremely quick forward pass. For such models the overhead of a GPU or Neural Engine kernel launch would be quite noticeable. But for those the old NEON was already OK, and if not, there also is a dedicated matrix unit there called AMX. Seems kinda random to me.
评论 #41509205 未加载
评论 #41531730 未加载
评论 #41534186 未加载
评论 #41531435 未加载
评论 #41535665 未加载
brcmthrowaway8 months ago
I&#x27;m dim, whats the difference between SVE and SME?
评论 #41533619 未加载
DanielLee58 months ago
Great review.
评论 #41567311 未加载
softwaredoug8 months ago
I just wish they’d make native tensorflow installation actually work without a million apple silicon specific exceptions :)
评论 #41532613 未加载