July 2024 Update on Instability Reports on Intel Core 13th/14th Gen Desktop CPUs

327 pointsby acrispino10 months ago

28 comments

phire10 months ago

I find it hard to believe that it actually is a microcode issue.Mostly because Intel has way too much motivation to pass it off as a microcode issue, as they can fix a microcode issue for free, by pushing out a patch. If it's an actual hardware issue, then Intel will be forced to actually recall all the faulty CPUs, which could cost them billions.The other reason, is that it took them way too long to give details. If it's as simple as a buggy microcode requesting an out-of-spec voltage from the motherboard, they should have been able to diagnose the problem extremely quickly and fix it in just a few weeks. They would have detected the issue as soon as they put voltage logging on the motherboard's VRM. And according to some sources, Intel have apparently been shipping non-faulty CPUs for months now (since April, from memory), and those don't have an updated microcode.This long delay and silence feels like they spent months of R&D trying to create a workaround, create a new voltage spec to provide the lowest voltage possible. Low enough to work around a hardware fault on as many units as possible, without too large of a performance regression, or creating new errors on other CPUs because of undervolting.I suspect that this microcode update will only "fix" the crashes for some CPUs. My prediction is that in another month Intel will claim there are actually two completely independent issues, and reluctantly issue a recall for anything not fixed by the microcode.

评论 #41043080 未加载

评论 #41046142 未加载

评论 #41044677 未加载

评论 #41044887 未加载

评论 #41042652 未加载

评论 #41046612 未加载

HeliumHydride10 months ago

<a href="https://scholar.harvard.edu/files/mickens/files/theslowwinter.pdf" rel="nofollow">https://scholar.harvard.edu/files/mickens/files/theslowwinte...</a>"Unfortunately for John, the branches made a pact with Satan and quantum mechanics [...] In exchange for their last remaining bits of entropy, the branches cast evil spells on future genera- tions of processors. Those evil spells had names like “scaling- induced voltage leaks” and “increasing levels of waste heat” [...] the branches, those vanquished foes from long ago, would have the last laugh.""John was terrified by the collapse of the parallelism bubble, and he quickly discarded his plans for a 743-core processor that was dubbed The Hydra of Destiny and whose abstract Platonic ideal was briefly the third-best chess player in Gary, Indiana. Clutching a bottle of whiskey in one hand and a shot- gun in the other, John scoured the research literature for ideas that might save his dreams of infinite scaling. He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROP- ERTY VALUES IN TOKYO. It’s better to stop scaling your transistors and avoid playing with monsters in the first place, instead of devising an elaborate series of monster checks- and-balances and then hoping that the monsters don’t do what monsters are always going to do because if they didn’t do those things, they’d be called dandelions or puppy hugs."

评论 #41041395 未加载

评论 #41041601 未加载

评论 #41042268 未加载

评论 #41043214 未加载

tux310 months ago

Remains to be seen how the microcode patch affects performance, and how these CPUs that have been affected by over-voltage to the point of instability will have aged in 6 months, or a few years from now.More voltage generally improves stability, because there is more slack to close timing. Instability with high voltage suggests dangerous levels. A software patch can lower the voltage from this point on, but it can't take back any accumulated fatigue.

评论 #41041936 未加载

评论 #41041328 未加载

评论 #41048590 未加载

tpurves10 months ago

I think it's telling that they are delaying the microcode patch until after all the reviewers publish their Zen5 reviews and the comparisons of those chips against current Raptorlake performance.

评论 #41041745 未加载

userbinator10 months ago

Reminds me of Sudden Northwood Death Syndrome, 2002.Looks like history may be repeating itself, or at least rhyming somewhat.Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.Now, with CPU manufacturers attempting to squeeze all the performance they can, they are essentially doing this overclocking/overvolting automatically and dynamically in firmware (microcode), and it's not surprising that some bug or (deliberate?) ignorance that overlooked reliability may have pushed things too far. Intel may have been more conservative with the absolute maximum voltages until recently, and of course small process sizes with higher potential for electromigration are a source of increased fragility.Also anecdotal, but I have an 8th-gen mobile CPU that has been running hard against the thermal limits (100C) 24/7 for over 5 years (stock voltage, but with power limits all unlocked), and it is still 100% stable. This and other stories of CPUs in use for many years with clogged or even detached heatsinks seem to contribute to the evidence that high voltage is what kills CPUs, and neither heat nor frequency.Edit: I just looked up the VCore maximum for the 13th/14th processors - the datasheet says 1.72V! That is far more than I expected for a 10nm process. For comparison, a 1st-gen i7 (45nm) was specified at 1.55V absolute maximum, and in the 32nm version they reduced that to 1.4V; then for the 22nm version it went up slightly to 1.52V.

评论 #41047947 未加载

评论 #41042521 未加载

评论 #41044603 未加载

magicalhippo10 months ago

There was recently[1] some talk about how the 13th/14th gen mobile chips also had similar issues, though Intel insisted it's something else.Will be interesting to see how that pans out.[1]: <a href="https://news.ycombinator.com/item?id=41026123">https://news.ycombinator.com/item?id=41026123</a>

评论 #41041267 未加载

评论 #41040569 未加载

TazeTSchnitzel10 months ago

After watching <a href="https://youtube.com/watch?v=gTeubeCIwRw" rel="nofollow">https://youtube.com/watch?v=gTeubeCIwRw</a> and some related content, I personally don't believe it's an issue fixable with microcode. I guess we'll see.

评论 #41041837 未加载

wnevets10 months ago

Are the CPUs that received elevated operating voltage permanently damaged?

评论 #41040489 未加载

评论 #41041124 未加载

评论 #41042089 未加载

Covzire10 months ago

Just want to say, I'm incredibly happy with my 7800X3D. It runs ~70C max like Intel chips used to and with a $35 air cooler and it's on average the fastest chip for gaming workloads right now.

评论 #41041737 未加载

NBJack10 months ago

I was concerned this would happen to them, given how much power was being pushed through their chips to keep them competitive. I get the impression their innovation has either truly slowed down, or AMD thought enough 'moves' ahead with their tech/marketing/patents to paint them into a corner.I don't think Intel is done though, at least not yet.

brynet10 months ago

Curious why Intel announced this on their community forums, rather than somewhere more official.

评论 #41041410 未加载

评论 #41040609 未加载

评论 #41042091 未加载

评论 #41043079 未加载

christkv10 months ago

The amount of current their chips pull on full boost is pretty crazy. It would definitively not surprise me if some could get damaged by extensive boosting.

cdchn10 months ago

I built a system last fall with an i9-13900K and have been having the weirdest crashing problems with certain games that I never had problems with before. NEVER been able to track it down, no thermal issues, no overclocking, all updated drivers and BIOS. Maybe this is finally the answer I've been looking for.

评论 #41060677 未加载

uticus10 months ago

Dumb question: let’s say I am in charge of procurement for a significant amount of machines, do I not have the option of ordering machines from three generations back? Are older (proven reliable) processors just not available because they’re no longer made, like my 1989 Camry?

评论 #41048518 未加载

firebaze10 months ago

Nice that Intel acknowledges there are problems with that CPU generation. If I read this right, the CPUs have been supplied with a too-high voltage across the board, with some tolerating the higher voltages for longer, others not so much.Curious to see how this develops in terms of fixing defective silicon.

nubinetwork10 months ago

They already tried bios updates when they pushed out the "intel defaults" a couple months ago...

评论 #41041286 未加载

评论 #41040902 未加载

PedroBatista10 months ago

Good for Intel to finally "figure it out" but I'm not 100% sure microcode is 100% of the problem. As in everything complex enough, the "problem" can actually be many compounded problems, MB vendors "special" tune comes to mind.But this is already a mess very hard to clean since I feel many of these CPUs will die in an year or 2 because of these problems today but by then nobody will remember this and an RMA will be "difficult" to say the least.

评论 #41042173 未加载

Havoc10 months ago

> Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages.That’s great news for intel. If that’s correct. If not that’ll be a PR bloodbath

salamo10 months ago

Is there any info on how to diagnose this problem? Having just put together a computer with the 14900KF, I really don't want to swap it out if not necessary.

评论 #41048253 未加载

评论 #41043010 未加载

评论 #41042886 未加载

评论 #41042778 未加载

ChoGGi10 months ago

Hmm, mid August is after the new Ryzens are out, I wonder how bad of a performance hit this microcode update will bring?And will it actually fix the issue?<a href="https://www.youtube.com/watch?v=QzHcrbT5D_Y" rel="nofollow">https://www.youtube.com/watch?v=QzHcrbT5D_Y</a>

ChrisArchitect10 months ago

(updated from other post about mobile crashes)Related:Complaints about crashing 13th,14th Gen Intel CPUs now have data to back them up<a href="https://news.ycombinator.com/item?id=40962736">https://news.ycombinator.com/item?id=40962736</a>Intel is selling defective 13-14th Gen CPUs<a href="https://news.ycombinator.com/item?id=40946644">https://news.ycombinator.com/item?id=40946644</a>Intel's woes with Core i9 CPUs crashing look worse than we thought<a href="https://news.ycombinator.com/item?id=40954500">https://news.ycombinator.com/item?id=40954500</a>Warframe devs report 80% of game crashes happen on Intel's Core i9 chips<a href="https://news.ycombinator.com/item?id=40961637">https://news.ycombinator.com/item?id=40961637</a>

评论 #41041257 未加载

评论 #41041244 未加载

whalesalad10 months ago

If I didn’t just recently invest in 128gb of DDR4 I’d jump ship to AMD/AM5. My 13900k has been (knock on wood) solid though - with 24/7 uptime since July 2023.

评论 #41042197 未加载

评论 #41043043 未加载

eigenform10 months ago

by "microcode" i assume they meant "pcode" for the PCU? (but they decided not to make that distinction here for whatever reason?)

Night_Thastus10 months ago

"Elevated operating voltage" my foot.We've already seen examples of this happening on non-OC'd server-style motherboards that perfectly adhere to the intel spec. This isn't like ASUS going 'hur dur 20% more voltage' and frying chips. If that's all it was it would be obvious.Lowering voltage may help mitigate the problem, but it sure as shit isn't the cause.

评论 #41043431 未加载

评论 #41041670 未加载

评论 #41046610 未加载

acrispino10 months ago

An Intel employee is posting on reddit: <a href="https://www.reddit.com/r/intel/comments/1e9mf04/intel_core_13th14th_gen_desktop_processors/" rel="nofollow">https://www.reddit.com/r/intel/comments/1e9mf04/intel_core_1...</a>A recent YouTube video by GamersNexus speculated the cause of instability might be a manufacturing issue. The employee's response follows.Questions about manufacturing or Via Oxidation as reported by Tech outlets:Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.For the Instability issue, we are delivering a microcode patch which addresses exposure to elevated voltages which is a key element of the Instability issue. We are currently validating the microcode patch to ensure the instability issues for 13th/14th Gen are addressed

评论 #41040757 未加载

评论 #41041058 未加载

loufe10 months ago

Intel cannot afford to be anything but outstanding in terms of customer experience right now. They are getting assaulted on all fronts and need to do a lot to improve their image to stay competitive.

评论 #41040904 未加载

评论 #41040765 未加载

评论 #41041062 未加载

xyst10 months ago

Wonder what Linus has to say on this. Dude knows how to rip into crappy Intel products

评论 #41043943 未加载

评论 #41046912 未加载

fefe2310 months ago

So on one hand they are saying it's voltage (i.e. something external, not their fault, bad mainboard manufacturers!).On the other hand they are saying they will fix it in microcode. How is that even possible?Are they saying that their CPUs are signaling the mainboards to give them too much voltage?Can someone make sense of this? It reminds me of Steve Jobs' You Are Holding It Wrong moment.

评论 #41040927 未加载

评论 #41040813 未加载

评论 #41040835 未加载

评论 #41040909 未加载

评论 #41040798 未加载

评论 #41040739 未加载