I worked on this problem for the past year at Google. It's a fascinating problem. In my subarea I focused on accelerators (like GPUs) running machine learning training.<p>Many users report problems like "NaN" during training- at some point, the gradients blow up and the job crashes. Sometimes these are caused by specific examples, or numerical errors on the part of the model developer, but sometimes, they are the result of errors from bad cores (during matrix multiplication, embedding lookup, vector op, whatever).<p>ML is usually pretty tolerant of small amounts of added noise (especially if it's got nice statistcal properties) and some training jobs will ride through a ton of uncorrected and undetected errors with few problems. It's a very challenging field to work in because it's hard to know if your nan is because of your model or your chip.
Site is down, archive link: <a href="https://web.archive.org/web/20210602080638/https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf" rel="nofollow">https://web.archive.org/web/20210602080638/https://sigops.or...</a><p>What stands out to me:<p>- "Mercurial cores are extremely rare" but "we observe on the order of a few mercurial cores per several thousand machines". On average one core per 1000 machines is faulty? That's quite a high rate.<p>- Vendors surely must know about this? If not by testing then through experiencing the failures in their company servers.<p>- I've read the whole paper and I see no mention of them even reaching out to vendors about this issue. Their are strong incentives on both sides to solve or mitigate this issue so why aren't they working together?
The article references Dixit et al. for an example of a root cause investigation of a CEE which is an interesting read: <a href="https://arxiv.org/pdf/2102.11245.pdf" rel="nofollow">https://arxiv.org/pdf/2102.11245.pdf</a><p>> After a few iterations, it became obvious that the computation of 𝐼𝑛𝑡(1.153)=0 as an input to the math.pow function in Scala would always produce a result of 0 on Core 59 of the CPU. However, if the computation was attempted with a different input value set 𝐼𝑛𝑡(1.152)=142 the result was accurate.
I'd love to see more details on the defective parts, particularly counts of CPU model (anonymized if needs be) and counts of which part of the architecture exhibited faults.<p>From working in HPC I've handled reports of things like FMA units producing incorrect results or random appearance of NaNs. Were it not for the fact that we knew these things could happen and customer's intimate knowledge of their codes I dread to think how'd "normal" operations would track these issues down. Bad parts went back to the CPU manufacturer and further testing typically confirmed the fault. But that end of the process was pretty much a black box to anyone but the CPU manufacturer. I'd be keen to know more about this too.
Fault tolerance seems to be the fundamental issue looming in the background of both traditional and quantum computing at the moment. Silicon is already at the point where there are only a dozen or so dopant atoms per gate, so a fluctuation of one or two atoms can be enough to impact behavior. It's amazing to me that with billions of transistors things work as well as they do. At some point it might be good to try to re-approach computation from some kind of error-prone analogue of the turing machine.
> A deterministic AES mis-computation, which was “selfinverting”: encrypting and decrypting on the same core
yielded the identity function, but decryption elsewhere
yielded gibberish.<p>Incredible
I don't completely understand the perception that standard non hardened high-perf CPU, especially in an industry and more specifically in a segment that has been reported as consistently cutting a few corners in recent years (maybe <i>somehow</i> less than client CPUs, but still), should somehow be exempt of silent defaults, because... magic?<p>If you want extremely high reliability, for critical applications, you use other CPUs. Of course, they are slower.<p>So the only interesting info that remains is that the defect rate seems way too high and maybe the quality decreasing in recent years. In which case, when you are Google, you probably could and should complain (strongly) to your CPU vendors, because likely their testing is lacking and their engineering margins too low... (at least if that's really the silicon that is at fault, and not say for example the MB)<p>Now of course it's a little late for the existing ones, but still the sudden realization that "OMG CPU do sometimes fail, with a variety of modes, and for a variety of reasons" (including, surprise(?!), aging) seems, if not concentrating on the defect rate, naïve. And the potential risk of sometimes having high rate errors was already a very well known esp. in the presence of software changes and/or heterogenous software and/or heterogenous hardware, due to the existence of logical CPU bugs, sometimes also resulting in silent data corruption, and sometimes also with non-deterministic-like behaviors (so can as well work on a core but not another because of "random" memory controller pressure and delays, and the next time with the two cores reversed)
Modern embedded cores have self-testing code that detects anywhere from 50% to 90%[1] of faults in the hardware, including from ageing.<p>If google and the other hyperscalers complain enough, there’s no reason Intel couldn’t give them some self test to run every hour or so.<p>[1] depends on how complex the CPU is, how long you accept to run the self testing code, and how well it was done.
This is fascinating. I feel like the most straightforward (but hardly efficient) solution is to provide a way for kernels to ask CPUs to "mirror" pairs of cores, and have the CPUs internally check that the behaviors are identical? Seems like a good way to avoid large scale data corruption until we develop better techniques...
This might not be a constructive observation for me to post this comment, but I can just see the IBM mainframe designers sitting back with a refreshing beverage, while we talk about identifying and handling hardware faults during runtime.<p><a href="https://www.ibm.com/support/pages/ibm-power-systems™-reliability-availability-and-scalability-ras-features" rel="nofollow">https://www.ibm.com/support/pages/ibm-power-systems™-reliabi...</a>
Can't reproduce the issue after a few minutes? Sorry wont-fix, mercurial core.<p>Joking aside, it's really neat to see the scale and sophistication of error detection appearing in these data centers.
I wonder about the larger feedback loops between hardware error checking in software and the optimizations hardware manufacturers are making at the fab. Presumably more robust software would result in buggier cores being shipped, but would this actually result in more net computation per dollar spent on processors?