Author here if anyone has Pentium questions :-)<p>My Mastodon thread about the bug was on HN a few weeks ago, so this might seem familiar, but now I've finished a detailed blog post. The previous HN post has a bunch of comments: <a href="https://news.ycombinator.com/item?id=42391079">https://news.ycombinator.com/item?id=42391079</a>
The bug is super fun, but I also find the Intel response to be fascinating on its own. They apparently didn’t replace everyone’s processor with a non faulty version who wanted it, resulting in a ton of bad press.<p>To contrast, I’ve been thinking a lot about the Amazon Colorsoft launch, which had a yellow band graphics issue on some devices (mine included). Amazon waited a bit before acknowledging it (maybe a day or two, presumably to get the facts right). Then they simply quietly replace all of them. No recall. They just send you a new one if you ask for it (mine replacement comes Friday, hopefully it will fix it). My takeaway is that it’s pretty clear that having an incredibly robust return/support apparatus has a lot of benefits when launches don’t go quite right. Certainly more than you’d expect from analysis.<p>Similarly I haven’t seen too many recent reports about the Apple AirPod Pros crackle issue that happened a couple years ago (my AirPods had to be replaced twice), but Apple also just quietly replaced them and the support competence really seemed something powerful that isn’t always noticed.<p>Colorsoft: <a href="https://www.tomsguide.com/tablets/e-readers/amazon-kindle-colorsoft-yellow-stripe-defect-now-has-a-culprit" rel="nofollow">https://www.tomsguide.com/tablets/e-readers/amazon-kindle-co...</a><p>AirPods Pro: <a href="https://support.apple.com/airpods-pro-service-program-sound-issues" rel="nofollow">https://support.apple.com/airpods-pro-service-program-sound-...</a>
> Intel's whitepaper claimed that a typical user would encounter a problem once every 27,000 years, insignificant compared to other sources of error such as DRAM bit flips.<p>> However, IBM performed their own analysis,29 suggesting that the problem could hit customers every few days.<p>I bet these aren’t as far off as they seem. Intel seems to be considering a single user, while I suspect IBM is thinking in terms of support calls.<p>This is a problem I’ve had at work. When you process a 100 million requests a day the one in a billion problem is hitting you a few times a month. If it’s something a customer or worse a manager notices, they ignore the denominator and suspect you all of incompetence. Four times a month can translate into “all the time” in the manner humans bias their experiences. If you get two statistical clusters of three in a week someone will lose their shit.
> It appears that only one person (Professor Nicely) noticed the bug in actual use.<p>I recall a study done years ago where students were supplied calculators for their math class. The calculators had been doctored to produce incorrect results. The researchers wanted to know how wrong the calculators had to be before the students noticed something was amiss.<p>It was a factor of 2.<p>Noticing the error, and being affected by the error, are two entirely different things.<p>I.e. how many people check to see if the computer's output is correct? I'd say very, very, very few. Not me, either, except in one case - when I was doing engineering computations at Boeing, I'd run the equations backwards to verify the outputs matched the inputs.
I remember that bug. Because I could not control what CPU my customers were running on, I had to add special code in the library to detect the bad FPU and execute workaround code (this code was supplied by Intel).<p>I.e. Intel's problem became my problem, grrrr
Reminds me of a joke floating around at the time that captures a couple different 90s themes:<p><pre><code> I AM PENTIUM OF BORG.
DIVISION IS FUTILE.
YOU WILL BE APPROXIMATED.</code></pre>
Another great article from Ken. I remember this particularly because the first PC that I bought with my own money had an affected CPU. Prior to this era I hadn't been much interested in PCs because they couldn't run "real" software. But Windows NT changed that (thank you Mr. Cutler), and Taiwanese sourced low cost motherboards made it practical to build your own machine, as many people still do today. Ken touched on the fact that it was easy for users to check if their CPU was affected. I remember that this was as easy as typing a division expression with the magic numbers into Excel. If MS had released a version of Excel that worked around the bug, I suspect fewer users would have claimed their replacement device!
What an interesting and utterly dedicated analysis. Thank you so much for all your work analysing the silicon and sharing your findings. I particularly like how you’re able to call out Intel on the actual root cause, which their PR made sound like something analogous to a trivial omission. But, in fact, was less forgivable and more blameworthy, ie they stuffed up their table generation algorithm.
Given that the fixed table is a much simpler one (by letting out-of-bounds just return 2, rather than adding circuitry to make it return 0), I wonder why they didn't just do it that way in the first place?
>Since only one in 9 billion values caused the problem, Intel's view was that the problem was trivial: "This doesn't even qualify as an errata."<p>This sounds utterly insane. You are making a CPU, if any calculations are wrong it needs to be fixed ??
I supposed this only came to light very late into testing and it was very impractical to bin every cpu, so they rolled the dice.
> Curiously, the adder is an 8-bit adder but only 7 bits are used; perhaps the 8-bit adder was a standard logic block at Intel.<p>I believe this is because for any adder you always want 1 bit extra to detect overflow! This is why 9 bit adders are a common component in MCUs
I'm surprised they took the risk of extending the lookup table to have all 2's in the undefined region. A safer route would have been to just fix the 5 entries. Someone was pretty confident!
At the 2012 Turning Award conference in San Francisco, Prof William Kahan mentioned that he had a newer test suite available in 1993 that would have caught Intel's bug. Still, Intel did not run that.. Prof. Kahan was actively involved in its analysis and further testing. (I'm stating this just from memory).
> The explanation is that Intel didn't just fill in the five missing table entries with the correct value of 2. Instead, Intel filled all the unused table entries with 2.<p>I wonder why they didn't do this in the first place.
From someone who had to mentally let go once you started talking about planes crossing each other, thank you for such an amazingly detailed writeup. It's not everyday that you learn a new cool way to divide numbers!
> He called Intel tech support but was brushed off<p>I laughed when I read this. It’s hard enough to get support for basic issues, good luck explaining a hardware bug.
Reminds me of part 2 of day24. Some wrong wirings. ;-)<p><a href="https://adventofcode.com/2024/day/24" rel="nofollow">https://adventofcode.com/2024/day/24</a>