Intel's $475M error: the silicon behind the Pentium division bug

378 pointsby gslin5 months ago

22 comments

kens5 months ago

Author here if anyone has Pentium questions :-)My Mastodon thread about the bug was on HN a few weeks ago, so this might seem familiar, but now I've finished a detailed blog post. The previous HN post has a bunch of comments: <a href="https://news.ycombinator.com/item?id=42391079">https://news.ycombinator.com/item?id=42391079</a>

评论 #42539938 未加载

评论 #42535746 未加载

评论 #42538127 未加载

评论 #42537164 未加载

评论 #42538878 未加载

评论 #42545985 未加载

evanmoran5 months ago

The bug is super fun, but I also find the Intel response to be fascinating on its own. They apparently didn’t replace everyone’s processor with a non faulty version who wanted it, resulting in a ton of bad press.To contrast, I’ve been thinking a lot about the Amazon Colorsoft launch, which had a yellow band graphics issue on some devices (mine included). Amazon waited a bit before acknowledging it (maybe a day or two, presumably to get the facts right). Then they simply quietly replace all of them. No recall. They just send you a new one if you ask for it (mine replacement comes Friday, hopefully it will fix it). My takeaway is that it’s pretty clear that having an incredibly robust return/support apparatus has a lot of benefits when launches don’t go quite right. Certainly more than you’d expect from analysis.Similarly I haven’t seen too many recent reports about the Apple AirPod Pros crackle issue that happened a couple years ago (my AirPods had to be replaced twice), but Apple also just quietly replaced them and the support competence really seemed something powerful that isn’t always noticed.Colorsoft: <a href="https://www.tomsguide.com/tablets/e-readers/amazon-kindle-colorsoft-yellow-stripe-defect-now-has-a-culprit" rel="nofollow">https://www.tomsguide.com/tablets/e-readers/amazon-kindle-co...</a>AirPods Pro: <a href="https://support.apple.com/airpods-pro-service-program-sound-issues" rel="nofollow">https://support.apple.com/airpods-pro-service-program-sound-...</a>

评论 #42537288 未加载

评论 #42537446 未加载

评论 #42536983 未加载

评论 #42538425 未加载

评论 #42536879 未加载

hinkley5 months ago

> Intel's whitepaper claimed that a typical user would encounter a problem once every 27,000 years, insignificant compared to other sources of error such as DRAM bit flips.> However, IBM performed their own analysis,29 suggesting that the problem could hit customers every few days.I bet these aren’t as far off as they seem. Intel seems to be considering a single user, while I suspect IBM is thinking in terms of support calls.This is a problem I’ve had at work. When you process a 100 million requests a day the one in a billion problem is hitting you a few times a month. If it’s something a customer or worse a manager notices, they ignore the denominator and suspect you all of incompetence. Four times a month can translate into “all the time” in the manner humans bias their experiences. If you get two statistical clusters of three in a week someone will lose their shit.

评论 #42535889 未加载

WalterBright5 months ago

> It appears that only one person (Professor Nicely) noticed the bug in actual use.I recall a study done years ago where students were supplied calculators for their math class. The calculators had been doctored to produce incorrect results. The researchers wanted to know how wrong the calculators had to be before the students noticed something was amiss.It was a factor of 2.Noticing the error, and being affected by the error, are two entirely different things.I.e. how many people check to see if the computer's output is correct? I'd say very, very, very few. Not me, either, except in one case - when I was doing engineering computations at Boeing, I'd run the equations backwards to verify the outputs matched the inputs.

评论 #42541866 未加载

评论 #42537268 未加载

WalterBright5 months ago

I remember that bug. Because I could not control what CPU my customers were running on, I had to add special code in the library to detect the bad FPU and execute workaround code (this code was supplied by Intel).I.e. Intel's problem became my problem, grrrr

stickfigure5 months ago

Reminds me of a joke floating around at the time that captures a couple different 90s themes:<pre><code> I AM PENTIUM OF BORG. DIVISION IS FUTILE. YOU WILL BE APPROXIMATED.</code></pre>

评论 #42545743 未加载

dboreham5 months ago

Another great article from Ken. I remember this particularly because the first PC that I bought with my own money had an affected CPU. Prior to this era I hadn't been much interested in PCs because they couldn't run "real" software. But Windows NT changed that (thank you Mr. Cutler), and Taiwanese sourced low cost motherboards made it practical to build your own machine, as many people still do today. Ken touched on the fact that it was easy for users to check if their CPU was affected. I remember that this was as easy as typing a division expression with the magic numbers into Excel. If MS had released a version of Excel that worked around the bug, I suspect fewer users would have claimed their replacement device!

评论 #42537119 未加载

urbandw311er5 months ago

What an interesting and utterly dedicated analysis. Thank you so much for all your work analysing the silicon and sharing your findings. I particularly like how you’re able to call out Intel on the actual root cause, which their PR made sound like something analogous to a trivial omission. But, in fact, was less forgivable and more blameworthy, ie they stuffed up their table generation algorithm.

ThrowawayTestr5 months ago

>Smith posted the email on a Compuserve forum, a 1990s version of social media.I hate how this sentence makes me feel.

评论 #42537239 未加载

评论 #42536953 未加载

评论 #42547318 未加载

Sniffnoy5 months ago

Given that the fixed table is a much simpler one (by letting out-of-bounds just return 2, rather than adding circuitry to make it return 0), I wonder why they didn't just do it that way in the first place?

评论 #42537833 未加载

评论 #42536716 未加载

评论 #42536700 未加载

评论 #42536939 未加载

评论 #42537434 未加载

Jean-Papoulos5 months ago

>Since only one in 9 billion values caused the problem, Intel's view was that the problem was trivial: "This doesn't even qualify as an errata."This sounds utterly insane. You are making a CPU, if any calculations are wrong it needs to be fixed ?? I supposed this only came to light very late into testing and it was very impractical to bin every cpu, so they rolled the dice.

ijustlovemath5 months ago

> Curiously, the adder is an 8-bit adder but only 7 bits are used; perhaps the 8-bit adder was a standard logic block at Intel.I believe this is because for any adder you always want 1 bit extra to detect overflow! This is why 9 bit adders are a common component in MCUs

评论 #42541463 未加载

chiph5 months ago

I'm surprised they took the risk of extending the lookup table to have all 2's in the undefined region. A safer route would have been to just fix the 5 entries. Someone was pretty confident!

评论 #42542830 未加载

hyperman15 months ago

How did idiv work on the pentium. Was it also optimized, or somehow connected to fdiv, or just the old slow algorithm?

keshavmr5 months ago

At the 2012 Turning Award conference in San Francisco, Prof William Kahan mentioned that he had a newer test suite available in 1993 that would have caught Intel's bug. Still, Intel did not run that.. Prof. Kahan was actively involved in its analysis and further testing. (I'm stating this just from memory).

CaliforniaKarl5 months ago

> The explanation is that Intel didn't just fill in the five missing table entries with the correct value of 2. Instead, Intel filled all the unused table entries with 2.I wonder why they didn't do this in the first place.

评论 #42543418 未加载

Unearned51615 months ago

From someone who had to mentally let go once you started talking about planes crossing each other, thank you for such an amazingly detailed writeup. It's not everyday that you learn a new cool way to divide numbers!

tgma5 months ago

Intel $475B error: not building a decent GPU

评论 #42536947 未加载

评论 #42538128 未加载

fourseventy5 months ago

Didn't Intel have floating point division issues more recently as well?

评论 #42535716 未加载

评论 #42535707 未加载

coin5 months ago

> He called Intel tech support but was brushed offI laughed when I read this. It’s hard enough to get support for basic issues, good luck explaining a hardware bug.

pieterr5 months ago

Reminds me of part 2 of day24. Some wrong wirings. ;-)<a href="https://adventofcode.com/2024/day/24" rel="nofollow">https://adventofcode.com/2024/day/24</a>

fortran775 months ago

"At Intel, Quality is job 0.9999999999999999762"

22 comments

kens5 months ago

评论 #42539938 未加载

评论 #42535746 未加载

评论 #42538127 未加载

评论 #42537164 未加载

评论 #42538878 未加载

评论 #42545985 未加载

evanmoran5 months ago

评论 #42537288 未加载

评论 #42537446 未加载

评论 #42536983 未加载

评论 #42538425 未加载

评论 #42536879 未加载

hinkley5 months ago

评论 #42535889 未加载

WalterBright5 months ago

评论 #42541866 未加载

评论 #42537268 未加载

WalterBright5 months ago

stickfigure5 months ago

Reminds me of a joke floating around at the time that captures a couple different 90s themes:<pre><code> I AM PENTIUM OF BORG. DIVISION IS FUTILE. YOU WILL BE APPROXIMATED.</code></pre>

评论 #42545743 未加载

dboreham5 months ago

评论 #42537119 未加载

urbandw311er5 months ago

ThrowawayTestr5 months ago

>Smith posted the email on a Compuserve forum, a 1990s version of social media.I hate how this sentence makes me feel.

评论 #42537239 未加载

评论 #42536953 未加载

评论 #42547318 未加载

Sniffnoy5 months ago

评论 #42537833 未加载

评论 #42536716 未加载

评论 #42536700 未加载

评论 #42536939 未加载

评论 #42537434 未加载

Jean-Papoulos5 months ago

ijustlovemath5 months ago

评论 #42541463 未加载

chiph5 months ago

I'm surprised they took the risk of extending the lookup table to have all 2's in the undefined region. A safer route would have been to just fix the 5 entries. Someone was pretty confident!

评论 #42542830 未加载

hyperman15 months ago

How did idiv work on the pentium. Was it also optimized, or somehow connected to fdiv, or just the old slow algorithm?

keshavmr5 months ago

CaliforniaKarl5 months ago

评论 #42543418 未加载

Unearned51615 months ago

tgma5 months ago

Intel $475B error: not building a decent GPU

评论 #42536947 未加载

评论 #42538128 未加载

fourseventy5 months ago

Didn't Intel have floating point division issues more recently as well?

评论 #42535716 未加载

评论 #42535707 未加载

coin5 months ago

> He called Intel tech support but was brushed offI laughed when I read this. It’s hard enough to get support for basic issues, good luck explaining a hardware bug.

pieterr5 months ago

Reminds me of part 2 of day24. Some wrong wirings. ;-)<a href="https://adventofcode.com/2024/day/24" rel="nofollow">https://adventofcode.com/2024/day/24</a>

fortran775 months ago

"At Intel, Quality is job 0.9999999999999999762"