I find it hard to believe that it actually is a microcode issue.<p>Mostly because Intel has way too much motivation to pass it off as a microcode issue, as they can fix a microcode issue for free, by pushing out a patch. If it's an actual hardware issue, then Intel will be forced to actually recall all the faulty CPUs, which could cost them billions.<p>The other reason, is that it took them way too long to give details. If it's as simple as a buggy microcode requesting an out-of-spec voltage from the motherboard, they should have been able to diagnose the problem extremely quickly and fix it in just a few weeks. They would have detected the issue as soon as they put voltage logging on the motherboard's VRM. And according to some sources, Intel have apparently been shipping non-faulty CPUs for months now (since April, from memory), and those don't have an updated microcode.<p>This long delay and silence feels like they spent months of R&D trying to create a workaround, create a new voltage spec to provide the lowest voltage possible. Low enough to work around a hardware fault on as many units as possible, without too large of a performance regression, or creating new errors on other CPUs because of undervolting.<p>I suspect that this microcode update will only "fix" the crashes for some CPUs. My prediction is that in another month Intel will claim there are actually two completely independent issues, and reluctantly issue a recall for anything not fixed by the microcode.
<a href="https://scholar.harvard.edu/files/mickens/files/theslowwinter.pdf" rel="nofollow">https://scholar.harvard.edu/files/mickens/files/theslowwinte...</a><p>"Unfortunately for John, the branches made a pact with Satan
and quantum mechanics [...] In exchange for their last remaining
bits of entropy, the branches cast evil spells on future genera-
tions of processors. Those evil spells had names like “scaling-
induced voltage leaks” and “increasing levels of waste heat”
[...] the branches,
those vanquished foes from long ago, would have the last laugh."<p>"John was terrified by the collapse of the parallelism bubble,
and he quickly discarded his plans for a 743-core processor
that was dubbed The Hydra of Destiny and whose abstract
Platonic ideal was briefly the third-best chess player in Gary,
Indiana. Clutching a bottle of whiskey in one hand and a shot-
gun in the other, John scoured the research literature for ideas
that might save his dreams of infinite scaling. He discovered
several papers that described software-assisted hardware
recovery. The basic idea was simple: if hardware suffers more
transient failures as it gets smaller, why not allow software to
detect erroneous computations and re-execute them? This idea
seemed promising until John realized THAT IT WAS THE
WORST IDEA EVER. Modern software barely works when the
hardware is correct, so relying on software to correct hardware
errors is like asking Godzilla to prevent Mega-Godzilla from
terrorizing Japan. THIS DOES NOT LEAD TO RISING PROP-
ERTY VALUES IN TOKYO. It’s better to stop scaling your
transistors and avoid playing with monsters in the first place,
instead of devising an elaborate series of monster checks-
and-balances and then hoping that the monsters don’t do what
monsters are always going to do because if they didn’t do those
things, they’d be called dandelions or puppy hugs."
Remains to be seen how the microcode patch affects performance, and how these CPUs that have been affected by over-voltage to the point of instability will have aged in 6 months, or a few years from now.<p>More voltage generally improves stability, because there is more slack to close timing. Instability with high voltage suggests dangerous levels. A software patch can lower the voltage from this point on, but it can't take back any accumulated fatigue.
I think it's telling that they are delaying the microcode patch until <i>after</i> all the reviewers publish their Zen5 reviews and the comparisons of those chips against current Raptorlake performance.
Reminds me of Sudden Northwood Death Syndrome, 2002.<p>Looks like history may be repeating itself, or at least rhyming somewhat.<p>Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.<p>Now, with CPU manufacturers attempting to squeeze all the performance they can, they are essentially doing this overclocking/overvolting automatically and dynamically in firmware (microcode), and it's not surprising that some bug or (deliberate?) ignorance that overlooked reliability may have pushed things too far. Intel may have been more conservative with the absolute maximum voltages until recently, and of course small process sizes with higher potential for electromigration are a source of increased fragility.<p>Also anecdotal, but I have an 8th-gen mobile CPU that has been running hard against the thermal limits (100C) 24/7 for over 5 years (stock voltage, but with power limits all unlocked), and it is still 100% stable. This and other stories of CPUs in use for many years with clogged or even detached heatsinks seem to contribute to the evidence that high voltage is what kills CPUs, and neither heat nor frequency.<p>Edit: I just looked up the VCore maximum for the 13th/14th processors - the datasheet says 1.72V! That is far more than I expected for a 10nm process. For comparison, a 1st-gen i7 (45nm) was specified at 1.55V absolute maximum, and in the 32nm version they reduced that to 1.4V; then for the 22nm version it went up slightly to 1.52V.
There was recently[1] some talk about how the 13th/14th gen mobile chips also had similar issues, though Intel insisted it's something else.<p>Will be interesting to see how that pans out.<p>[1]: <a href="https://news.ycombinator.com/item?id=41026123">https://news.ycombinator.com/item?id=41026123</a>
After watching <a href="https://youtube.com/watch?v=gTeubeCIwRw" rel="nofollow">https://youtube.com/watch?v=gTeubeCIwRw</a> and some related content, I personally don't believe it's an issue fixable with microcode. I guess we'll see.
Just want to say, I'm incredibly happy with my 7800X3D. It runs ~70C max like Intel chips used to and with a $35 air cooler and it's on average the fastest chip for gaming workloads right now.
I was concerned this would happen to them, given how much power was being pushed through their chips to keep them competitive. I get the impression their innovation has either truly slowed down, or AMD thought enough 'moves' ahead with their tech/marketing/patents to paint them into a corner.<p>I don't think Intel is done though, at least not yet.
The amount of current their chips pull on full boost is pretty crazy. It would definitively not surprise me if some could get damaged by extensive boosting.
I built a system last fall with an i9-13900K and have been having the weirdest crashing problems with certain games that I never had problems with before. NEVER been able to track it down, no thermal issues, no overclocking, all updated drivers and BIOS. Maybe this is finally the answer I've been looking for.
Dumb question: let’s say I am in charge of procurement for a significant amount of machines, do I not have the option of ordering machines from three generations back? Are older (proven reliable) processors just not available because they’re no longer made, like my 1989 Camry?
Nice that Intel acknowledges there are problems with that CPU generation. If I read this right, the CPUs have been supplied with a too-high voltage across the board, with some tolerating the higher voltages for longer, others not so much.<p>Curious to see how this develops in terms of fixing defective silicon.
Good for Intel to finally "figure it out" but I'm not 100% sure microcode is 100% of the problem. As in everything complex enough, the "problem" can actually be many compounded problems, MB vendors "special" tune comes to mind.<p>But this is already a mess very hard to clean since I feel many of these CPUs will die in an year or 2 because of these problems today but by then nobody will remember this and an RMA will be "difficult" to say the least.
> Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages.<p>That’s great news for intel. If that’s correct. If not that’ll be a PR bloodbath
Is there any info on how to diagnose this problem? Having just put together a computer with the 14900KF, I <i>really</i> don't want to swap it out if not necessary.
Hmm, mid August is after the new Ryzens are out, I wonder how bad of a performance hit this microcode update will bring?<p>And will it actually fix the issue?<p><a href="https://www.youtube.com/watch?v=QzHcrbT5D_Y" rel="nofollow">https://www.youtube.com/watch?v=QzHcrbT5D_Y</a>
(updated from other post about mobile crashes)<p>Related:<p>Complaints about crashing 13th,14th Gen Intel CPUs now have data to back them up<p><a href="https://news.ycombinator.com/item?id=40962736">https://news.ycombinator.com/item?id=40962736</a><p>Intel is selling defective 13-14th Gen CPUs<p><a href="https://news.ycombinator.com/item?id=40946644">https://news.ycombinator.com/item?id=40946644</a><p>Intel's woes with Core i9 CPUs crashing look worse than we thought<p><a href="https://news.ycombinator.com/item?id=40954500">https://news.ycombinator.com/item?id=40954500</a><p>Warframe devs report 80% of game crashes happen on Intel's Core i9 chips<p><a href="https://news.ycombinator.com/item?id=40961637">https://news.ycombinator.com/item?id=40961637</a>
If I didn’t just recently invest in 128gb of DDR4 I’d jump ship to AMD/AM5. My 13900k has been (knock on wood) solid though - with 24/7 uptime since July 2023.
"Elevated operating voltage" my foot.<p>We've already seen examples of this happening on non-OC'd server-style motherboards that perfectly adhere to the intel spec. This isn't like ASUS going 'hur dur 20% more voltage' and frying chips. If that's all it was it would be obvious.<p>Lowering voltage may help mitigate the problem, but it sure as shit isn't the cause.
An Intel employee is posting on reddit: <a href="https://www.reddit.com/r/intel/comments/1e9mf04/intel_core_13th14th_gen_desktop_processors/" rel="nofollow">https://www.reddit.com/r/intel/comments/1e9mf04/intel_core_1...</a><p>A recent YouTube video by GamersNexus speculated the cause of instability might be a manufacturing issue. The employee's response follows.<p><i>Questions about manufacturing or Via Oxidation as reported by Tech outlets:</i><p><i>Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.</i><p><i>Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.</i><p><i>For the Instability issue, we are delivering a microcode patch which addresses exposure to elevated voltages which is a key element of the Instability issue. We are currently validating the microcode patch to ensure the instability issues for 13th/14th Gen are addressed</i>
Intel cannot afford to be anything but outstanding in terms of customer experience right now. They are getting assaulted on all fronts and need to do a lot to improve their image to stay competitive.
So on one hand they are saying it's voltage (i.e. something external, not their fault, bad mainboard manufacturers!).<p>On the other hand they are saying they will fix it in microcode.
How is that even possible?<p>Are they saying that their CPUs are signaling the mainboards to give them too much voltage?<p>Can someone make sense of this?
It reminds me of Steve Jobs' You Are Holding It Wrong moment.