CPU reliability – Linus Torvalds (2007)

203 pointsby semicolondevover 11 years ago

18 comments

zxcdwover 11 years ago

I don't work in environment where I get to deal with hardware failures, so pardon my ignorance, but has anyone seen a failed CPU piece which has failed during normal operation? I am under an impression that it is very rare for a CPU itself to fail so that it would need to be replaced.The only times I've even heard about failing CPUs has been if they've been overclocked or insufficiently cooled(add in overvolting, and you get both :)) or physical damage during mounting/unmounting or otherwise handling hardware. And even then the failure has usually been elsewhere than the CPU itself.Of course I am not saying it'd be unheard of, but for me frankly, right now it is.

评论 #6872835 未加载

评论 #6873349 未加载

评论 #6873538 未加载

评论 #6872433 未加载

评论 #6872787 未加载

评论 #6872607 未加载

评论 #6873855 未加载

评论 #6875210 未加载

评论 #6872415 未加载

评论 #6872404 未加载

评论 #6873814 未加载

评论 #6872421 未加载

评论 #6872583 未加载

评论 #6873249 未加载

评论 #6875087 未加载

评论 #6874456 未加载

Taniwhaover 11 years ago

so not even mentioned here is metastability - basically signals that cross clock domains within traditional clocked logic where the clocks are not carefully organized to be multiples of each other can end up being sampled just as they change - the result is a value inside of a flip-flop that's neither a 1 or a 0 - sometimes an analog value somewhere in between, sometimes an oscillating mess at some unknown frequency - worst worst case this unknown bad value can end up propagating into a chip causing havoc, a buzzing mess of chaos.In the real world this doesn't happen very often and there are techniques to mitigate it when it does (usually at a performance or latency cost) - core CPUs are probably safe, they're all one clock but display controllers, networking, anything that touches the real world has to synchronize with it.For example I was involved with designing a PC graphics chip in the mid '90s - we did the calculations around metastability (we had 3 clock domains and 2 crossings), we calculated that our chip would suffer from metastability (might be as simple as a burble on one frame of a screen, or a complete breakdown) about once every 70 years - we decided we could live with that as they were running on Win95 systems - no one would ever noticeEveryone who designs real world systems should be doing that math - more than one clock domain is a no no in life support rated systems - your pacemaker for example

评论 #6872613 未加载

评论 #6872523 未加载

pedrocrover 11 years ago

It would be awesome if companies like Google would calculate MTBF statistics on components. They've done it for disks and it would be great to extend it to CPUs and memory modules. They're probably in a better position than even Intel to calculate these things with precision.

评论 #6872140 未加载

评论 #6871978 未加载

评论 #6871842 未加载

评论 #6871821 未加载

评论 #6871831 未加载

rdtscover 11 years ago

There was an interesting quote/anecdote, Joe Armstrong likes to tell, it is about people who claim they've built a reliable or fault tolerant service. They would say "This is fault tolerant, they are multiple hard drives in there, I have done formal verification of my code and so on..." and then someone else trips over the power cord and that's the end of the fault tolerance. It is just a silly example, of course they'd properly provide power to an important rack of hardware, but the point is, in the simplest case the system is only as fault tolerant as its weakest components. It is that one bad capacitor from Taiwan that might the whole thing down, or just a silly cosmic ray.One needs redundant hardware to provide certain guarantees about the service being up. This means load balancers, multiple CPUs running the same code in parallel and comparing results, running on separate power buses, different data centers, different parts of the world.

评论 #6872628 未加载

bcoatesover 11 years ago

Thread context: <a href="https://lkml.org/lkml/2007/5/11/179" rel="nofollow">https://lkml.org/lkml/2007/5/11/179</a>

heavisideover 11 years ago

This study by Microsoft Research is interesting:"Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs"<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=144888" rel="nofollow">http://research.microsoft.com/apps/pubs/default.aspx?id=1448...</a>

sytelusover 11 years ago

If MTBF is such a big issue then would it be ever possible to build space craft that travels across the stars and still has ability communicate? I guess hats off to designers of Voyager and other spacecrafts whose MTBF seems to have crossed 36+ years for many components including CPU and power supply. But for inter-steller crafts that MTBF seems VERY low. And, seriously, MTBF of 5 years seems to be joke for desktop when lot of mechanical components with moving parts actually lasts longer.

评论 #6871990 未加载

评论 #6871938 未加载

评论 #6871932 未加载

评论 #6871937 未加载

评论 #6871926 未加载

raverbashingover 11 years ago

(Conventional) Solid state devices are very hard to fail - exception: flash memoryApart from electron migration issues and failures by excess (voltage/temperature), they're pretty long lastingMuch easier to have a failure because of something else: capacitors failing, oxidation or mechanical failure (for example, because of thermal expansion/contraction)I've seen people complaining about a dead CPU but I can't find it right now

评论 #6872735 未加载

评论 #6873365 未加载

mrichover 11 years ago

As a side note, the whole site is an amazing collection of wisdom and worth bookmarking:<a href="http://yarchive.net/" rel="nofollow">http://yarchive.net/</a>

AnonNo15over 11 years ago

I'd like to through my experience: I was in charge of 300+ x86 rack servers and around 50 desktops for 3 years and never seen a single CPU fail, even old Pentium 4 with dusty fans.Disk failures are very common, followed by much rarer RAM chips and motherboards failures.I suspect server chips are rated for 10-15 years average lifespan

synthosover 11 years ago

Soft errors are a very real property of low-voltage digital electronics. I personally observed what could only be realistically explained as a soft error in a unit running customer hardware in the field. A single bit was flipped in the program memory of the embedded application and was causing the system to malfunction in an obvious and repeatable manor. We've since added CRC checking to the program memory and some of the static data sections to flag and reset this in the future.

lispythonover 11 years ago

There's a more than 100 pages's thread talk about GUP failure after two years use in Apple Support website. <a href="https://discussions.apple.com/thread/4766577" rel="nofollow">https://discussions.apple.com/thread/4766577</a>

dspeyerover 11 years ago

It doesn't seem worth it for Intel to measure MTBF. By the time they got good numbers for a specific chip, they'd be trying to sell its successor.

评论 #6872293 未加载

Zardoz84over 11 years ago

I can say that the Z80 if my ZX Spectrum keep working since 1984... Or some old K6-2 300 was working this last year...

评论 #6871998 未加载

mvanveenover 11 years ago

My immediate reaction is to ask how this reliability characteristic of CPUs affects critical software applications? Certainly some space missions and medical devices out in the field must have surpassed the MTBF mark for the given CPU deployment.

jokoonover 11 years ago

I always wondered about this, but does it seem transistor do wear off over time ?Does that mean a CPU/RAM/GPU will not perform as well as when it's brand new ?

csmukover 11 years ago

Never had a CPU go on me.RAM yes, PROMs yes, CMOS batteries yes, PSUs yes, drives yes.They're probably the most reliable bit of a computer.

leokunover 11 years ago

Nice thing about the cloud is that someone else is worrying about this for you.