If I'm understanding it correctly, they're not actually using the 512 bit (ZMM) registers, because using them can cause overall system slowdown. It seems to me they're only really useful if you're doing an AVX-512 intensive workload. And do those really exist? For something like bulk matrix multiplications, GPGPU is going to be much better, both in throughput and in operations per joule. I'm remaining to be convinced that the ecological niche occupied by SIMD is significant, let alone expanding.
It will be a while before AVX-512 becomes practical however. AMD doesn't support it (so any RyZen or Threadripper fans will miss out), and even Intel 8th Gen Coffee-lake doesn't support it.<p>Only Intel Extreme i9 and Xeon Silver / Gold / Platinum seems to support it. So the market for this instruction set is quite limited.
<p><pre><code> document.querySelector('#k2Container').style.color = 'black';
</code></pre>
and the blog post becomes almost readable.<p>Other than that, nice intro.
I changed some Golang code to AVX in my last project. In isolation that code ran like 2-4x faster but as part of the full program, the program was 5% slower overall. Could never make a sense of it. Any thoughts on how to determine the cause?
Doesn't mention what I find the coolest part of AVX-512: the conflict detection instructions. Finally a way to vectorize loops with indirect loads!