I don't work in environment where I get to deal with hardware failures, so pardon my ignorance, but has anyone seen a failed CPU piece which has failed during normal operation? I am under an impression that it is very rare for a CPU itself to fail so that it would need to be replaced.<p>The only times I've even heard about failing CPUs has been if they've been overclocked or insufficiently cooled(add in overvolting, and you get both :)) or physical damage during mounting/unmounting or otherwise handling hardware. And even then the failure has usually been elsewhere than the CPU itself.<p>Of course I am not saying it'd be unheard of, but for me frankly, right now it is.
so not even mentioned here is metastability - basically signals that cross clock domains within traditional clocked logic where the clocks are not carefully organized to be multiples of each other can end up being sampled just as they change - the result is a value inside of a flip-flop that's neither a 1 or a 0 - sometimes an analog value somewhere in between, sometimes an oscillating mess at some unknown frequency - worst worst case this unknown bad value can end up propagating into a chip causing havoc, a buzzing mess of chaos.<p>In the real world this doesn't happen very often and there are techniques to mitigate it when it does (usually at a performance or latency cost) - core CPUs are probably safe, they're all one clock but display controllers, networking, anything that touches the real world has to synchronize with it.<p>For example I was involved with designing a PC graphics chip in the mid '90s - we did the calculations around metastability (we had 3 clock domains and 2 crossings), we calculated that our chip would suffer from metastability (might be as simple as a burble on one frame of a screen, or a complete breakdown) about once every 70 years - we decided we could live with that as they were running on Win95 systems - no one would ever notice<p>Everyone who designs real world systems should be doing that math - more than one clock domain is a no no in life support rated systems - your pacemaker for example
It would be awesome if companies like Google would calculate MTBF statistics on components. They've done it for disks and it would be great to extend it to CPUs and memory modules. They're probably in a better position than even Intel to calculate these things with precision.
There was an interesting quote/anecdote, Joe Armstrong likes to tell, it is about people who claim they've built a reliable or fault tolerant service. They would say "This is fault tolerant, they are multiple hard drives in there, I have done formal verification of my code and so on..." and then someone else trips over the power cord and that's the end of the fault tolerance. It is just a silly example, of course they'd properly provide power to an important rack of hardware, but the point is, in the simplest case the system is only as fault tolerant as its weakest components. It is that one bad capacitor from Taiwan that might the whole thing down, or just a silly cosmic ray.<p>One needs redundant hardware to provide certain guarantees about the service being up. This means load balancers, multiple CPUs running the same code in parallel and comparing results, running on separate power buses, different data centers, different parts of the world.
This study by Microsoft Research is interesting:<p>"Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs"<p><a href="http://research.microsoft.com/apps/pubs/default.aspx?id=144888" rel="nofollow">http://research.microsoft.com/apps/pubs/default.aspx?id=1448...</a>
If MTBF is such a big issue then would it be ever possible to build space craft that travels across the stars and still has ability communicate? I guess hats off to designers of Voyager and other spacecrafts whose MTBF seems to have crossed 36+ years for many components including CPU and power supply. But for inter-steller crafts that MTBF seems VERY low. And, seriously, MTBF of 5 years seems to be joke for desktop when lot of mechanical components with moving parts actually lasts longer.
(Conventional) Solid state devices are very hard to fail - exception: flash memory<p>Apart from electron migration issues and failures by excess (voltage/temperature), they're pretty long lasting<p>Much easier to have a failure because of something else: capacitors failing, oxidation or mechanical failure (for example, because of thermal expansion/contraction)<p>I've seen people complaining about a dead CPU but I can't find it right now
As a side note, the whole site is an amazing collection of wisdom and worth bookmarking:<p><a href="http://yarchive.net/" rel="nofollow">http://yarchive.net/</a>
I'd like to through my experience:
I was in charge of 300+ x86 rack servers and around 50 desktops for 3 years and never seen a single CPU fail, even old Pentium 4 with dusty fans.<p>Disk failures are very common, followed by much rarer RAM chips and motherboards failures.<p>I suspect server chips are rated for 10-15 years average lifespan
Soft errors are a very real property of low-voltage digital electronics. I personally observed what could only be realistically explained as a soft error in a unit running customer hardware in the field. A single bit was flipped in the program memory of the embedded application and was causing the system to malfunction in an obvious and repeatable manor. We've since added CRC checking to the program memory and some of the static data sections to flag and reset this in the future.
There's a more than 100 pages's thread talk about GUP failure after two years use in Apple Support website. <a href="https://discussions.apple.com/thread/4766577" rel="nofollow">https://discussions.apple.com/thread/4766577</a>
It doesn't seem worth it for Intel to measure MTBF. By the time they got good numbers for a specific chip, they'd be trying to sell its successor.
My immediate reaction is to ask how this reliability characteristic of CPUs affects critical software applications? Certainly some space missions and medical devices out in the field must have surpassed the MTBF mark for the given CPU deployment.
I always wondered about this, but does it seem transistor do wear off over time ?<p>Does that mean a CPU/RAM/GPU will not perform as well as when it's brand new ?