In my experience, based on profiling and optimizing of ML-based guitar amp models in the PiPedal project (<a href="https://rerdavies.github.io/pipedal/" rel="nofollow">https://rerdavies.github.io/pipedal/</a>), when using only neon instructions, performance is almost completely constrained by L2 memory bandwidth. Compute cost almost completely disappear while waiting for memory loads and stores.<p>So, although these devices have ferociously impressive FLOP rates, I'm extremely curious as to how the cost of memory loads and stores is going to work.<p>I can very well imagine that having large local tile buffers is going to dramatically improve performance. But I'm curious how much. No matter how fast the compute speed is, it seems to me that performance of these sorts of devices in practice is going to be constrained by memory transfer rates. And perhaps by L1 caches in the tile compute unit that are better optimized for tile computation than the L1 cache on a general-purpose cPU.<p>My current expectation: that performance of matrix multiplies increases linearly with respect to tile size. i.e. a tile size if 8x8 floats will perform twice as fast as a matrix multiplier with a tile size of 4x4, since doubling the tile size reduces the required transfers to and from L2 by a factor of two.<p>So, compared to a basic A72 ARM neon (effectively, 4x8 tile size), I would expect about a 4x improvement by virtue of the fact that the tile size is larger on the Apple tile processor. Both entirely otherwise limited by the cost of L2 memory loads and stores. And maybe another 2x or 3x improvement because the tile processor L1 caches (tile buffers) are tuned for tile multiply/accumulate operations.<p>Could somebody comment on how these devices actually perform on real matrix multiplies? It seems inconceivable to me that these devices will actually achieve peak FLOP rates in anything but meaningless test cases. And also somewhat of a meaningless exercise to measure peak performance using test cases that are designed to completely eliminate L2 memory transfers.