Very useful -- I will take this analysis into account when it's time to upgrade my current personal machine or configure the next one! Thank you for posting this here.<p>The only thing I would have wanted to see but didn't in this analysis is how failure rates vary for different types of disk subsystem -- specifically, traditional hard drives versus the newer solid-state devices. I suspect, but don't know for sure, that the latter have much, much lower real-world failure rates in the first 30 days of total accumulated CPU time (TACT).<p>The authors openly suggest that the sharp difference in failure rates between desktop and laptop machines may be due in part to their disk subsystems: "Laptops are between 25% and 60% less likely than desktop machines to crash from a hardware fault over the first 30 days of observed TACT. We hypothesize that the durability features built into laptops (such as motion-robust hard drives) make these machines more robust to failures in general." Alas, the authors don't delve any further into it.<p>I'd like to see hard data comparing the real-world failure rates of <i>both</i> desktops and laptops using traditional versus solid-state disk subsystems.
When Microsoft, Google, or some university publish analysis of hardware failures across large numbers of machines, they always anonymize hardware vendors ("vendor A", "vendor B").<p>I understand the reasons (not alienating your hardware vendors), but will there ever be a research group who will disclose vendor names? Heck, I would <i>pay</i> for this information.
Among other interesting insights:<p>* a machine that crashed once is 100 times more likely to crash again; the more it crashes, the more it's prone to fail again.<p>* overclocking significantly reduces reliability. One CPU vendor (AMD or Intel, but unspecified) is much worse in this regard, too.<p>* conversely, underclocking improves reliability.<p>* branded computers are more reliable than beige boxes.<p>* laptops are more reliable than desktops.
I'm having a hard time coming to terms with "Laptops less likely to crash from hardware fault than desktops"<p>Everything we've learned from experience, surveys, and PC World magazines has showed the opposite. Heat kills hardware and laptops have their hardware packed together so closely that it generates lots of heat. Back then I remember reading something like 1 in 4 laptops fail in the first 3 years. Which was very believable, at the time I was in collage for game design & development. All 80 guys in our class had laptops from HP (with get this... Pentium 4s in them). Those laptops had a LOT of problems. They were basically portable heaters.<p>So I guess laptops now have either much better cooling, much cooler CPUs or a combination. OR PCs are just terribly cooled.
Interesting stuff. You can improve reliability by running your system at a lower speed. Here's a blog post with a summary of some of the conclusions of the paper above: <a href="http://grano.la/blog/2012/06/improve-the-reliability-of-your-pc/" rel="nofollow">http://grano.la/blog/2012/06/improve-the-reliability-of-your...</a> (Disclaimer: that's my company's blog)<p>One question I still have is whether the switching of CPU frequencies has any effect, or if it is only the average speed that correlates to the reliability. Anecdotal evidence suggests that this is the case, but it could be an area for further research.
Interesting, too bad the power supplies could not be controlled in their setup, as a wonky power supply can unleash all kinds of gremlins that look like failures in components down the line.
Interesting read though why can't Microsoft just tell me that my CPU or HD or memory is borking and suggest I RMA it instead of saying everytime - have you applied the latest updates, which I get to click unhelpful.<p>Most important thing in a PC I have found for reliability above everything else is a good PSU, realy does make a difference on the hardware side as you give your kit cleaner power. Add UPS/surge protector and you can double the lifetime of kit. Least from experience I've had it has been noticable.
The copy at Microsoft Research is offline.<p>Here's another copy of the paper:<p><a href="http://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nightingale.pdf" rel="nofollow">http://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nighti...</a>
There are a lot of insights in the paper, but I'd really like to know about this:<p>"The table shows that CPUs from Vendor A are
nearly 20x as likely to crash a machine during the 8 month
observation period when they are overclocked, and CPUs
from Vendor B are over 4x as likely"<p>Obviously it's 5 times difference in probability to have unstable system if overclocked between Intel and AMD but they don't say which one is better. Anybody knows?
The result most surprising to me is that laptops are between 25% and 60% less likely than desktop machines to crash from a hardware fault during the first 30 days worth of measurements.<p>The much larger weight and volume of desktops would seem to make them easier to cool.
I've always said that smart hardware tinkerers <i>underclock</i>. It produces less heat, and results in a quieter machine. I always suspected it improves reliability.