Linus Torvalds on AVX512

140 点作者 ykm将近 5 年前

14 条评论

robocat将近 5 年前

The AVX512 instructions can cause strange global performance downgrades.“One challenge with AVX-512 is that it can actually _slow down_ your code. It's so power hungry that if you're using it on more than one core it almost immediately incurs significant throttling. Now, if everything you're doing is 512 bits at a time, you're still winning. But if you're interleaving scalar and vector arithmetic, the drop in clock speeds could slow down the scalar code quite substantially.“ - 3JPLW and <a href="https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/" rel="nofollow">https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...</a>The processor does not immediately downclock when encountering heavy AVX512 instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Light 512-bit instructions will move the core to a slightly lower clock.* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).As per <a href="https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/" rel="nofollow">https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...</a>

评论 #23810654 未加载

评论 #23817900 未加载

abainbridge将近 5 年前

What are the forces in chip design that are at play here? Over the last 10-15 years, fabs have continued to fit more and more logic gates per unit area, but haven't reduced the power consumption per gate as much. As a result, if you fill your modern chip with compute gates, you cannot use them all at once because the chip will melt. Or at least you can't have them all running at max clock rates. One solution is to increase the proportion of the chip used for SRAM (it uses less power per unit area than compute gates), this is what Graphcore have done. Another is to put down multiple different compute blocks, each designed for a different purpose, and only use them a-few-at-a-time. The big-little Arm designs in smartphones are an example of that. But I feel like AVX512 might be an example too. When they add ML accelerator blocks next, they also will not be able to be used flat out at the same time as the rest of the cores' resources.I'm sure Intel should fix the problems Linus is complaining about, but I feel like chip vendors are being forced into this "add special purpose blocks" approach, as the only way to make their new chips better than their old ones.

评论 #23810093 未加载

评论 #23810269 未加载

评论 #23810018 未加载

评论 #23812031 未加载

评论 #23815381 未加载

评论 #23810033 未加载

floatboth将近 5 年前

I agree that there's too much focus on FP, but SIMD is not all about FP. Every new SIMD ISA extension has something interesting for integer.Here's an article about JITing x86 to AVX-512 to fuzz 16 VMs per thread:<a href="https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_emulation.html" rel="nofollow">https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_e...</a>

raverbashing将近 5 年前

FP matters (especially with SIMD)It matters to image/video/audio processingIt matters to simulationsIt matters to 3D models/renderingIt matters to gamesSo it's not "just benchmarks", people actually want to do stuff with itSure, AVX512 might not be the greatest way of doing it, and it might be better to just make the existing instructions go faster, that might work

评论 #23810692 未加载

评论 #23814715 未加载

rbanffy将近 5 年前

I for one would be delighted by having more caches or wider backends instead of AVX512, but I don't want SIMD to be pushed into GPUs. It'd be better to do the reverse - to push forward the asymmetric core idea and move more GPU functionality into lots of simpler cores tuned for SIMD at the cost of single thread performance.

评论 #23809929 未加载

评论 #23809958 未加载

评论 #23809910 未加载

jasonzemos将近 5 年前

AVX-512's richness to x86 is like what C++'s is to C. Linus makes a summary assessment for how he can leverage these technologies to his advantage and if the cost of learning the technology and all its intricacies outweighs the perceived advantage: that technology is garbage. This reaction from Linus appears to fit his conservative pattern. I think where Linus gets things wrong stems from his facts rather than his philosophy.AVX-512's fantastic breadth is born out of an actual need to free compilers from constraints imposed by programs in virtually every mainstream language. All of these describe programs for an academic-machine rooted in a scalar instruction model. Without any further performance from increasing cycles over time the target has to become instructions-per-cycle and even operations-per-instruction. The limitations on ILP and the expense of powering circuitry to achieve it has been well studied for the past two decades. The failure to realize it is evident in the failure of Netburst. Linus believes that the frontend of CPU's have a lot more to give; perhaps best exhibited with his refutation of CMOV (<a href="https://yarchive.net/comp/linux/cmov.html" rel="nofollow">https://yarchive.net/comp/linux/cmov.html</a>).Today's programming languages haven't evolved to make things easier on programmers to describe non-scalar code. On the other hand, power constraints, and now security constraints haven't made things easier for hardware to efficiently execute scalar code. Perhaps AVX-512 is as naive a bet as Itanium, if not it might be just the missing piece compilers need that they didn't have twenty years ago.

RantyDave将近 5 年前

Are Intel just delaying the inevitable? Is it safe to say (even today) that a slow GPU will crunch big matrices faster than a fast CPU? And that's before we get to price/performance. So all that's left is the bottleneck around PCIe which, in theory, leaves the CPU with an advantage only for small datasets - which we don't really care about anyway (because they happen quickly).Maybe the tradeoff is somewhere interesting from a latency perspective - SDR or similar. I dunno, am I barking up the wrong tree?

评论 #23809988 未加载

评论 #23809833 未加载

评论 #23809937 未加载

评论 #23809940 未加载

评论 #23810058 未加载

fancyfredbot将近 5 年前

AVX512 is both integer and floating point, not just FP, so this rant about FP comes across as ill informed.Despite that I'd agree most people probably see no benefit from these units today. But that could change. For workloads with parallelism, wide SIMD is very efficient - more so than multiple threads anyway. The only way to get people to write vector code is to have vector processing available. Once it's ubiquitously available people might code for it and the benefits may become more apparent.

throwaway_pdp09将近 5 年前

The very wide AVX stuff with integer ops, like these from wiki:- AVX-512 Byte and Word Instructions (BW) – extends AVX-512 to cover 8-bit and 16-bit integer operations[3]- AVX-512 Integer Fused Multiply Add (IFMA) - fused multiply add of integers using 52-bit precision.could be very useful. I could have done with those recently. They also don't (AFAIK) cause cpu scaling (polite term for downclocking). He may well be right with FP though.

评论 #23809978 未加载

评论 #23809992 未加载

dang将近 5 年前

A related discussion is here: <a href="https://news.ycombinator.com/item?id=23822203" rel="nofollow">https://news.ycombinator.com/item?id=23822203</a>, also with interesting comments.Since this thread is of the second freshness, we won't merge.

bartwe将近 5 年前

Down with simd, up with spmd/compute

nullc将近 5 年前

There are 1001 AVX512 variations, but few equivalent operations to the RISV bit manipulation instructions.

gridlockd将近 5 年前

"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota.Because absolutely nobody cares outside of benchmarks."That was back in the stone age when a lot of applications for FP math weren't mainstream. Most of AVX-512 doesn't even concern FP, there's lots of integer and bit twiddling stuff there.Furthermore, people really do care about these benchmarks. It influences their purchasing, which is really the thing that matters most to Intel. A lot of people don't actually care about hypothetical security issues or the fact that the CPU is 14nm when it still outperforms 7nm in single-threaded code.Also, it's not like you can just trade off IPC or extra cores for wider SIMD. It's not like "just add more cores" is just as good for throughput, otherwise GPUs wouldn't exist. Wider SIMD is cheap in terms of die area, for the throughput it gives you.Lastly, these are just instructions, nothing says that an AVX-512 instruction needs to go through a physical 512-bit wide unit, it just says that you can take advantage of those semantics, if possible.

CamperBob2将近 5 年前

Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks.Today I learned that even Linus Torvalds has a bozo bit. [1] When's the last time he actually did anything with a computer?1: <a href="https://en.wikipedia.org/wiki/Bozo_bit" rel="nofollow">https://en.wikipedia.org/wiki/Bozo_bit</a>

评论 #23810307 未加载