“Unable to perform AVX2 instructions correctly under heavy load” is also a common “WTF Intel!?”–inducing phenomenon. I’m certain SREs who work at companies with more than 1 million servers have a bunch of hair pulling stories.<p>Most (all?) Intel server CPUs in fact decrease clock speed when executing AVX2 (and some other) instructions to keep things a bit more sane. Vlad from Cloudflare wrote about this, more specific to AVX-512 back in 2017: <a href="https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/" rel="nofollow">https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...</a><p>Then there is PROCHOT signal. Which is supposed to protect the CPU from getting too hot but keeps getting raised in lopsided AVX2 loads not because CPU is too hot but voltage regulation gets whacked.<p>You may wonder: what is an example of AVX2 heavy load. RSA multiplication is a good candidate. AES constructions or modes (CBC with SHA, GCM) are implemented in AVX2-BMI2 as well.
Most performance motherboards with Intel unlocked K models will downclock the maximum boost when using AVX instructions. The reason is very high power draw and temperatures. For example my i5 9600k runs at 5ghz turbo boost on all cores but 4.7 when using AVX. If I disable that option it crashed with prolonged usage like benchmarks.<p>Edit: To be clear, the i5 9600k is sold as 3.7ghz with boost up to 4.6 on a single core. So there is a difference with the AMD case in that this doesn't happen on the setting Intel sell it at.
My 3970X on the ASRock Taichi (with default settings, generally) does not seem to reproduce this issue at this time - the system remains operational despite the FMA3 path being used (I'm assuming this is behind the AVX2 flag? disabling FMA3 leads to a plain AVX path) while running an all-core test with 16K FFTs in Prime95.<p>Either a slight background workload (Windows seems to be trying to use half a core for an OS update) resolves this, or this board does not have a broken power design?
He's not alone; I've had similar problems with my 3960X.<p>It seems to be a power delivery issue, and fortunately fixable if you disable all spread-spectrum and VRM power-saving options, but the Zen series seems a tricky beast.<p>I've had machine crashes triggered by using the "wrong" CPU scheduler under Linux. It's amazing, in a horrible way.
Hmm are we having another "AMD motherboards are crap" moment?<p>Or is it simply that delivering 200+W at load through a CPU socket can't be reliably done at consumer prices?<p>Anyone has had this problem with less high end CPUs? Something at 95-65 W?
The original Ryzen (1000 series) shipped plenty units with hardware defects that could be exposed by running parallel compiles. The so-called segfault bug:<p><a href="https://www.phoronix.com/scan.php?page=article&item=new-ryzen-fixed&num=1" rel="nofollow">https://www.phoronix.com/scan.php?page=article&item=new-ryze...</a>
I can reproduce this problem on my non-Threadripper Ryzen 5 3600 Zen 2 CPU. I don't think it's specific to TR.<p>With AVX2 enabled, Prime95's torture test is only stable when I use 3 workers or less. With 4 workers one of them will abort due to an error within 20 seconds. The more workers, the sooner a crash; with 5 workers it happens within 10 seconds, and with 6 workers it happens within 2-3 seconds.<p>If I play with the tests on and off for a while, seemingly increasing the quiescent temperature of the CPU, the whole experience and testing actually becomes a bit more stable. My motherboard uses the B450M chipset.
Do some more investigating. Drop the memory clocks to stock, drop the CPU clocks to, perhaps, 3 GHz, and see if the same issues happen. If they do, there's a systemic issue that needs to be addressed. If the issue disappears, try raising the clock incrementally until the issue reappears. Get a Kill-a-watt and look at power usage for each frequency and graph the results.
> Finally, a note on CPU temperatures: At idle the CPU hovers around 39-50 °C and tops around 72-78 °C under full load. I’m using the best air cooling setup I could think of and get my hands on, but it’s still air cooling, and my system is installed in a closed case (but with extreme attention to airflow).<p>I know air coolers can be competitive, but it says right on the outside of the 2950X box that you should use liquid cooling.
Can anyone suggest a different CPU load-testing tool other than prime95, that might catch things prime95 wouldn't?<p>I have a machine running a 1950X and I get random ffmpeg segfaults anywhere from six to eight hours in to an encoding session with all 16 cores fully loaded, but the machine is prime95 stable for a week+, so I suspect it's an AVX/AVX2 issue.
Probably the motherboard, or VRM brownout to be more exact. That said, I'm glad I did not pick up a 3970X like I was planning to, yet. AMD is pretty great with its warranty, motherboard manufacturers can be a chore.