TechEcho

2 comments

rsyncabout 4 years ago

We thought a lot about correlated storage failures - especially with regard to SSDs - as we rebuilt our infrastructure circa 2012/2013.In the end, the low hanging fruit - or, the biggest actionable takeaway - was that when we build boot mirrors out of SSDs, they should not be identical SSDs.This was a hunch I had, personally, and I think experience and, now, results like these, bear it out.Consider: an SSD can fail in a logical way. Not because of physical stress or mechanical wear, which has all kinds of random noise in the results - but due to a particular sequence of usage. If the two SSDs are mirrored, it is possible that they receive identical usage sequences over their lifetime.... which means they can fail identically - perhaps simultaneously.Nothing fancy or interesting about the solution: all rsync.net storage arrays have boot mirrors that mix either the current generation Intel SSD with the previous generation Intel SSD or mix an Intel SSD with a Samsung SSD.

评论 #26483758 未加载

评论 #26480932 未加载

评论 #26481573 未加载

Psychlistabout 4 years ago

Also: different NAS hosts, RAID cards, etc. Those have correlated failure modes too.My personal backup strategy of buying a different backup drive every time seems wiser the more I learn.At work we have two different NAS setups, each full of a different brand of near-identical drives. But what we have been doing is buying a few new drives every quarter and rotating them in to the NAS boxes. So they're all WD 6TB Black or whatever, but of 12 drives we now have 4 original ones, then a pair 3 months newer than that, a pair 6 months newer and so on. The "old" drives go into random stuff round the office because we employ engineers and they seem to all like to have their own little 2-4 drive NAS boxes for "important stuff" (which is in many ways fine, we just have to regularly coach them on making sure the stuff they're actually working on is on our NAS where it gets backed up. We host a gitlab instance, for example, so their code and project docs are in that).

Correlated Failures in Storage Systems

2 comments

Correlated Failures in Storage Systems

2 comments