TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

AMD Threadripper 3970X under heavy AVX2 load: Defective by design?

225 点作者 franzb大约 5 年前

15 条评论

bdd大约 5 年前
“Unable to perform AVX2 instructions correctly under heavy load” is also a common “WTF Intel!?”–inducing phenomenon. I’m certain SREs who work at companies with more than 1 million servers have a bunch of hair pulling stories.<p>Most (all?) Intel server CPUs in fact decrease clock speed when executing AVX2 (and some other) instructions to keep things a bit more sane. Vlad from Cloudflare wrote about this, more specific to AVX-512 back in 2017: <a href="https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;on-the-dangers-of-intels-frequency-scaling&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;on-the-dangers-of-intels-frequen...</a><p>Then there is PROCHOT signal. Which is supposed to protect the CPU from getting too hot but keeps getting raised in lopsided AVX2 loads not because CPU is too hot but voltage regulation gets whacked.<p>You may wonder: what is an example of AVX2 heavy load. RSA multiplication is a good candidate. AES constructions or modes (CBC with SHA, GCM) are implemented in AVX2-BMI2 as well.
评论 #22385095 未加载
评论 #22385062 未加载
评论 #22384673 未加载
评论 #22384230 未加载
评论 #22385811 未加载
评论 #22384469 未加载
评论 #22384086 未加载
评论 #22385092 未加载
t0mas88大约 5 年前
Most performance motherboards with Intel unlocked K models will downclock the maximum boost when using AVX instructions. The reason is very high power draw and temperatures. For example my i5 9600k runs at 5ghz turbo boost on all cores but 4.7 when using AVX. If I disable that option it crashed with prolonged usage like benchmarks.<p>Edit: To be clear, the i5 9600k is sold as 3.7ghz with boost up to 4.6 on a single core. So there is a difference with the AMD case in that this doesn&#x27;t happen on the setting Intel sell it at.
评论 #22384985 未加载
ntauthority大约 5 年前
My 3970X on the ASRock Taichi (with default settings, generally) does not seem to reproduce this issue at this time - the system remains operational despite the FMA3 path being used (I&#x27;m assuming this is behind the AVX2 flag? disabling FMA3 leads to a plain AVX path) while running an all-core test with 16K FFTs in Prime95.<p>Either a slight background workload (Windows seems to be trying to use half a core for an OS update) resolves this, or this board does not have a broken power design?
评论 #22383409 未加载
评论 #22383553 未加载
评论 #22383983 未加载
Filligree大约 5 年前
He&#x27;s not alone; I&#x27;ve had similar problems with my 3960X.<p>It seems to be a power delivery issue, and fortunately fixable if you disable all spread-spectrum and VRM power-saving options, but the Zen series seems a tricky beast.<p>I&#x27;ve had machine crashes triggered by using the &quot;wrong&quot; CPU scheduler under Linux. It&#x27;s amazing, in a horrible way.
评论 #22383263 未加载
评论 #22383255 未加载
评论 #22383157 未加载
评论 #22383760 未加载
nottorp大约 5 年前
Hmm are we having another &quot;AMD motherboards are crap&quot; moment?<p>Or is it simply that delivering 200+W at load through a CPU socket can&#x27;t be reliably done at consumer prices?<p>Anyone has had this problem with less high end CPUs? Something at 95-65 W?
评论 #22383228 未加载
评论 #22384106 未加载
评论 #22383242 未加载
评论 #22385961 未加载
评论 #22386453 未加载
frou_dh大约 5 年前
The original Ryzen (1000 series) shipped plenty units with hardware defects that could be exposed by running parallel compiles. The so-called segfault bug:<p><a href="https:&#x2F;&#x2F;www.phoronix.com&#x2F;scan.php?page=article&amp;item=new-ryzen-fixed&amp;num=1" rel="nofollow">https:&#x2F;&#x2F;www.phoronix.com&#x2F;scan.php?page=article&amp;item=new-ryze...</a>
评论 #22383557 未加载
daneel_w大约 5 年前
I can reproduce this problem on my non-Threadripper Ryzen 5 3600 Zen 2 CPU. I don&#x27;t think it&#x27;s specific to TR.<p>With AVX2 enabled, Prime95&#x27;s torture test is only stable when I use 3 workers or less. With 4 workers one of them will abort due to an error within 20 seconds. The more workers, the sooner a crash; with 5 workers it happens within 10 seconds, and with 6 workers it happens within 2-3 seconds.<p>If I play with the tests on and off for a while, seemingly increasing the quiescent temperature of the CPU, the whole experience and testing actually becomes a bit more stable. My motherboard uses the B450M chipset.
评论 #22398401 未加载
评论 #22385851 未加载
johnklos大约 5 年前
Do some more investigating. Drop the memory clocks to stock, drop the CPU clocks to, perhaps, 3 GHz, and see if the same issues happen. If they do, there&#x27;s a systemic issue that needs to be addressed. If the issue disappears, try raising the clock incrementally until the issue reappears. Get a Kill-a-watt and look at power usage for each frequency and graph the results.
cma大约 5 年前
&gt; Finally, a note on CPU temperatures: At idle the CPU hovers around 39-50 °C and tops around 72-78 °C under full load. I’m using the best air cooling setup I could think of and get my hands on, but it’s still air cooling, and my system is installed in a closed case (but with extreme attention to airflow).<p>I know air coolers can be competitive, but it says right on the outside of the 2950X box that you should use liquid cooling.
评论 #22385120 未加载
ComputerGuru大约 5 年前
Can anyone suggest a different CPU load-testing tool other than prime95, that might catch things prime95 wouldn&#x27;t?<p>I have a machine running a 1950X and I get random ffmpeg segfaults anywhere from six to eight hours in to an encoding session with all 16 cores fully loaded, but the machine is prime95 stable for a week+, so I suspect it&#x27;s an AVX&#x2F;AVX2 issue.
评论 #22383724 未加载
评论 #22383420 未加载
评论 #22383383 未加载
评论 #22385994 未加载
评论 #22383342 未加载
评论 #22383440 未加载
ehutch79大约 5 年前
So, does this mean threadripper is unusable and we shouldn&#x27;t buy them?
评论 #22385986 未加载
评论 #22386507 未加载
评论 #22386087 未加载
评论 #22384937 未加载
评论 #22385534 未加载
评论 #22387200 未加载
m0zg大约 5 年前
Probably the motherboard, or VRM brownout to be more exact. That said, I&#x27;m glad I did not pick up a 3970X like I was planning to, yet. AMD is pretty great with its warranty, motherboard manufacturers can be a chore.
citilife大约 5 年前
It appears this could just be a prime95 bug from reading the comments.
评论 #22389962 未加载
maljx大约 5 年前
I ran the same test on my ryzen 3900X with no issues, MSI X570 ACE motherboard, seasonic PSU.
grokas大约 5 年前
Great to see L1T posted here.