Why I'm usually unnerved when modern SSDs die on us

314 pointsby stargraveover 6 years ago

34 comments

pkayeover 6 years ago

I worked on SSD firmware for quite a long time and here is my perspective.Early flash used to fairly reliable with almost minimal error correction. However with increasing density, smaller processes and multi level cells, it has gone progressively less reliable and slower. Here are some of the things that we need to worry about: <a href="https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2016/20160808_PreConfH_Parnell.pdf" rel="nofollow">https://www.flashmemorysummit.com/English/Collaterals/Procee...</a>To compensate for all these deficiencies, the SSD architecture and hence the entire FTL becomes very complicated because any part of it can become damaged at any time. We always have to have backup algorithms to recovery from any scenario. Its difficult to build algorithms that can recovery from arbitrary failures in a reasonable time. I cannot have a drive sitting around for 20 minutes trying to fsck itself.Another problem is that the job while rewarding is not very lucrative. The chance of a multi million dollar payoff for an employee is low. I have a higher chance working on a web connected gadget to become a millionaire. So that means it is really hard to recruit those who are top notch programmers who known how to figure out the algorithms, write the code, debug the hardware. Most new grads these days are interested in python, javascript and machine learning.

评论 #18657604 未加载

评论 #18658376 未加载

评论 #18657223 未加载

评论 #18657548 未加载

评论 #18661126 未加载

评论 #18660463 未加载

评论 #18661355 未加载

评论 #18658375 未加载

评论 #18659201 未加载

评论 #18658217 未加载

评论 #18657986 未加载

评论 #18661562 未加载

评论 #18665421 未加载

niftichover 6 years ago

Not that spinning HDDs are really any different, but SSDs are a perfect example of an entire computer that you attach to yours, and speak with through one of the (many) storage-oriented protocols. The device itself is a black box, and complex transformations take place between the physical persistence of the data and the logical structures that are exchanged on the wire. There are many layers of indirection, and many things that can go wrong, from fault with the underlying physical storage, a physical fault in the controller, or a logical (software) condition in the controller that puts it in an unrecoverable state.Spinning platter drives have parts that form a more relatable metaphor to humans' notions of wear and tear: skates of magnetic readers flying on a cushion of air above a rapidly rotating disc, with the gap separating a few dozen nanometers, often smaller than the process size in the controller's silicon. They have arms that can move the head over a particular disc radius, and a motor that spins the entire stack of platters. These mechanical components exhibit wear proportional to their use -- this makes intuitive sense, and is also recorded in the SMART attributes, so drives in old age and of many park cycles can be replaced preemptively before they catastrophically fail.SSDs are missing many of the usual mechanisms that would contribute to physical wear leading to sudden catastrophic failure in advanced age. This means that irrespective of their failure rate vs. HDDs, a higher proportion of their catastrophic failures are the fault of the controller. This is discouraging: essentially, the "storage layer" is now quite reliable, so the fallibility of the human-programmed controller is brought into light.

评论 #18656668 未加载

评论 #18656425 未加载

评论 #18657281 未加载

评论 #18657549 未加载

评论 #18656682 未加载

评论 #18663576 未加载

评论 #18671355 未加载

Waterluvianover 6 years ago

"When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen"Why shouldn't it? Isn't it just hardware too?"With spinning HDs, drives might die abruptly but you could at least construct narratives about what could have happened to do that"Why can't you do the same with SSDs?It feels like the author's main complaint is the frustration of not understanding SSD hardware as well.Is this a valid complaint? Are SSDs magical in some way? I'm not an expert but... It's just hardware with pieces that do stuff. Why can't we come up with an understanding of why it fails?

评论 #18656412 未加载

评论 #18656082 未加载

评论 #18656099 未加载

评论 #18656524 未加载

评论 #18656138 未加载

评论 #18658811 未加载

评论 #18657154 未加载

评论 #18657178 未加载

nneonneoover 6 years ago

A major problem with SSDs seems to be “firmware death” - where the flash chips are physically fine (or mostly fine), but the firmware (or firmware memory) has gotten corrupted due to some programming error, electrical glitch, or cosmic ray. I’ve had scores of older SSDs die after things like power outages and sudden shutdown events. This is super frustrating because the data is physically OK but the controller just isn’t responding to any requests anymore.I wonder if there’s an easy way to distinguish a controller failure from a flash failure from the behavior of the device over the last few seconds/minutes of operation. In theory a controller failure should cause a fairly abrupt loss of service, but I’m sure there are soft lockup failure modes too.

评论 #18656610 未加载

评论 #18660750 未加载

评论 #18656889 未加载

评论 #18656403 未加载

评论 #18656436 未加载

lisperover 6 years ago

This is not a technological problem, it's a cultural one. These problems are easily fixed ("easily" by the standards of technical problems that regularly get fixed in other regimes). The reason they don't get fixed is that the customer reaction to failures like this is to rant at the mysterious storage gods that are making their lives miserable.Needless to say, there are no mysterious storage gods. These are artifacts made by humans, and somewhere out there, there is an engineer who either understands why these failures are happening, or knows how to engineer these devices in such a way that when these failures happen, the cause can be determined, and then a design iteration can be done to reduce the failure rate and make the failure modes more robust. The reason this doesn't happen is that customers aren't demanding it. If major purchasers started demanding, essentially, an SLA from their SSD manufacturers, with actual financial consequences for violating it, you would be amazed how fast all of these problems would get fixed. But instead we vent our frustrations in blog posts and HN comments :-(

评论 #18657011 未加载

评论 #18657297 未加载

shittyadminover 6 years ago

I've experienced a few seriously strange issues with modern SSDs, even some of the better ones.I had a 512GB Samsung drive that became very slow randomly at doing IO operations, the whole machine would die for 10-30 seconds at a time once or twice a day while any process that tried to use the disk became blocked on IO. Then it'd come right back like everything was perfectly fine.Issues like this definitely worry me, we're basically completely blind as to what those controllers and flash chips are actually doing. Not that it wasn't a similar situation with HDD controllers before, but at least it didn't seem as unpredictable.

评论 #18656189 未加载

评论 #18658290 未加载

评论 #18660690 未加载

评论 #18659715 未加载

docker_upover 6 years ago

I worked at a storage company, and they reinforced to us how not only does the OS lie to us, but the hard drives also lie to the OS. So you can't take anything you get from a hard drive as reliable, you have to test the data once you get it, ex through CRC, etc. Data can get corrupt at any time.As densities of data get higher and higher, it doesn't take much to have a catastrophic data failure. The only way to protect against this is having multiple replicas of your data.

评论 #18658051 未加载

mshookover 6 years ago

In a way we need someone/something like Backblaze doing a SMART report about SSDs to let us know which SMART metrics we should be monitoring...Because they've shown most metrics are kinda useless or mean different things from one manufacturer to the other.<a href="https://www.backblaze.com/blog/what-smart-stats-indicate-hard-drive-failures/" rel="nofollow">https://www.backblaze.com/blog/what-smart-stats-indicate-har...</a> <a href="https://www.backblaze.com/blog/hard-drive-smart-stats/" rel="nofollow">https://www.backblaze.com/blog/hard-drive-smart-stats/</a>

pmdenover 6 years ago

"We had one SSD fail in this way and then come back when it was pulled out and reinserted, apparently perfectly healthy, which doesn't inspire confidence."We've experienced exactly the same thing. Our general course of action is to perform a hard power cycle of the server through IPMI - a warm cycle doesn't seem to work. I've always presumed it was down to dodgy SSD controller firmware given the way it suddenly stops appearing in the output of fdisk -l.

评论 #18656317 未加载

CosmicShadowover 6 years ago

I've had a few SSD's give me random issues, and they are so hard to pin down, sometimes they just work, other times they abruptly stop, or just aren't detected until like 3 reboots later and they work fine. After you've had troubles they make you feel like you are hanging on a hope that the ground won't fall out from under you.You also CAN'T HEAR if there is an issue, which acts as another warning sign that something might be going wrong or will go wrong soon. Loud tickings or clickings or overwork is a sure sign to start backing up and get ready to buy a new drive!

Severianover 6 years ago

Call me crazy, but I don't think that a Crucial MX300 is the best choice for an enterprise worthy ZFS drive. I get what the author is concerned about, but I wouldn't be that surprised that a consumer level SSD failed in what sounds like a heavily used fileserver.

评论 #18656736 未加载

评论 #18656346 未加载

评论 #18656073 未加载

lucb1eover 6 years ago

I don't know why my hard drives died either. And while a physical motor breaking is more tangible, a contact wearing out is also imagineable. I don't really care why ssds or hdds die, I care that they do and therefore I have backups (well, ideally I would). I've had spinning rust fail on me while I was sitting at it and it didn't help me save it, it might as well have been dead in zero seconds.

评论 #18656371 未加载

评论 #18656311 未加载

JohnFenover 6 years ago

It may be irrational, but I remain very distrustful of SSDs, in part for reasons like this. I use them occasionally as temporary storage, but I don't use them for anything that would cause me a headache if the drive died without warning. So far, my observation is that their lifespan is considerably shorter than spinning platter drives, and spinning platter drives typically give plenty of warning before actually dying.Perhaps I'll grow more comfortable after another decade or so, when there is enough real world experience to go by.

评论 #18657160 未加载

bluejay2387over 6 years ago

Maybe I am being overly simplistic, but shouldn't it not matter?Who in the modern age doesn't back up everything all the time? Don't we all operate with the assumption these things are going to blow at any time? 90%+ of my data is on cloud storage now anyway. When a SSD goes out don't you just chunk it in the drawer of old drives that you promise to take to the disposal center this weekend (and never do) and then take a quick trip to your local computer store for a new one?This reminds me of something an IT support staffer told me a long time ago... "The difference between a IT pro and a user is that to an IT pro hard drives are a consumable resources".

评论 #18656263 未加载

评论 #18657083 未加载

rkagererover 6 years ago

I've done a bit of ad-hoc reliability testing with SSD's.Some years ago I got a great deal on several Pacer disks and wrote a program to write a pseudo-random sequence of data (using a known initial seed) across the entire disk and read it back and compare. Part way through, the data didn't match. No ECC errors, nothing raised by the filesystem, just mismatched bits which came back in a manner which tried to "trick" me into thinking they were good data. This happened on like 5 of the 8 disks. Needless to say I sent those crappy SSD's back to the manufacturer (unfortunately only got a 2/3 refund) along with some harsh words for their engineers.I've had more name-brand SSD's fail, in various manners (even on well-reviewed Kingston drives). Sometimes in ways that can't be accessed at all, other times (at best of times) in a manner which doesn't allow writes but still allows reads (albeit at a trickle of a datarate).These days I use solely Intel-based, top-line SSD's, and some (very limited) Samsungs. The choice isn't based on empirical data, but rather and impression their bar is a little higher (or more conservative) in terms of reliability, and simply not wanting to deal with the apparent issues I've seemed to encounter with other brands. The downtime lost from restoring / reconstructing just isn't worth it to me. Maybe I'm paying twice as much as I ought to, but since making the switch many years back it's worked out pretty well and I've been happy / fortunate.I run my SSD's in RAID10 using high-end controllers (aside from a few in ZFS).Just my own subjective experiences, again I'm not doing this at scale.

loegover 6 years ago

I recently had a similar SSD failure, although it wasn't in a "new fileserver" but my daily use 2013 desktop. It was working, then it was producing write errors corrupting my filesystem, then the whole system died, very quickly. Fortunately for me, some data was recoverable from the corrupted disk; I had a local backup from 12h prior, and a tarsnap backup from about the same time back.(Um, here's where I have to be critical of tarsnap: their recovery performance is absolutely abysmal for small files. They're latency bound between you, their EC2 instance, and the backing S3 store. Think single or double digit kB/s and then think about how much data you back up with tarsnap. I can't recommend any other backup provider better, but this is an experience where tarsnap left me very disappointed.)Looking at that SSD and my other SSDs' SMART data, they report extra blocks remaining in SMART, and you can monitor that as it goes down. Ideally you replace the drive before it gets to zero.My primary mistake was simply not monitoring that data in an effective way.I don't think anyone who monitors HDDs has any real expectation that the high-level SMART yes/no is going to protect them from data loss. Instead they look at highly predictive factors like "Reallocated_Sector_Ct" or "Raw_Read_Error_Rate" (or even plain old "Power_On_Hours").For SSDs it's quite similar: Reallocated_Sector_Ct, Power_On_Hours_and_Msec, Available_Reservd_Space, Uncorrectable_Error_Cnt, Erase_Fail_Count, Workld_Media_Wear_Indic, Media_Wearout_Indicator. Maybe NAND_Writes_1GiB.NVME SSDs provide SMART-like data on logpage 2 ("Available spare", "Percentage used", "Power on hours"). For some reason the NVME spec does not require media to accept host-initiated self-checks, so most NVMe drives don't have the same functionality as smartctl --test. :-(

MarkusWandelover 6 years ago

For my home setup, at least, it's simple: Put the OS on a dirt cheap 120GB SSD, and all the user data on a multi-terabyte hard disk. You can always selectively migrate other performance critical, but can afford to lose, stuff onto the SSD later. If it breaks, I just buy another one and reinstall the OS. On laptops that can only take one drive, the SSD is it, but so is awareness that the data on them has to be considered ephemeral. I've had assorted hard disks die over the years, from old age, and so far without exception they've been "mostly" recoverable - might have to give up on a few files that got hit by bad sectors, that sort of thing. And have been warned about impending failure by SMART diagnostics.

评论 #18656462 未加载

mirimirover 6 years ago

My first experience with drive failure was a ~40MB HDD expansion card in a 386. The bearings got "sticky", so the spindle wouldn't start rotating. But there was a Al tape covered hole, and you could insert the eraser end of a pencil, and nudge it. So yes, very understandable.Not too much later, I used Iomega ZIP drives, and experienced the "click of death". That was sudden, and irreversible, but also very understandable.For the past couple decades, I've consistently used RAID arrays, mostly RAID1 or RAID10 (and RAID0 or RAID5-6 for ephemeral stuff). I've had several HDD failures, but they were usually progressive, and I just swapped out and rebuilt.I recently had my first SSD failure. And it was also progressive. The first symptom was system freeze, requiring hard reboot, and then I'd see that one of the SSDs had dropped out of the array. But I could add it back. At first, I thought that there was some software problem, and that the RAID stuff was just caused by hard reboot.But eventually, the box wouldn't boot, so I had to replace the bad SSD and rebuilt the array. It was complicated by having sd1 RAID10 for /boot, and sd5 RAID10 for LVM2 and LUKS. So I also had to run fdisk before device mapper would work.

linsomniacover 6 years ago

From reading that blog and it's sister post about "flaky SMART data" on those same Crucial MX500 drives, reminds me that not all SSDs are created equal.Just like not all hard drives are created equal. My previous job involved a decade running 10 cabinets of servers an hour away with very little manpower: we eventually came to find that IBM/HGST drives were a lot more reliable than others.We also evaluated some early SSDs, and they were terribly unreliable. We eventually settled on the Intel drives and they were superb. My new job we've been using mostly Intel and Samsung Pro drives, they work great. But Dell sent us a server with some "enterprise SSDs" in it, that we eventually found were Plextor drives. Those things were terrible. We replaced them immediately with Intel, but used some of the Plextor drives and had all of them fail within a year. I'd put the Intel 64GB SLC drives from our 7 year old database server in a system before I'd put one of those brand new "enterprise" Plextor drives in.I love Crucial, I buy a lot of RAM from them, but I'm skeptical of switching to other brands of SSDs. The more experience I have, the more conservative I get with systems that matter.

tuzakeyover 6 years ago

I had a bunch of Crucial SSDs die a few years back, they'd work for an hour then disappear from the bus. Reboot and they'd work again for an hour. It turned out Crucial had a small counter tracking uptime by the hour, it would increment the counter to an overflow and crash. This failure could just have easily occur on a spinning hdd.

评论 #18659563 未加载

dogbenover 6 years ago

Crucial is using low grade NAND on some of the products: <a href="https://www.reddit.com/r/hardware/comments/a4uwag/spectek_flash_without_logo_grade_marking_low/" rel="nofollow">https://www.reddit.com/r/hardware/comments/a4uwag/spectek_fl...</a>

pineboxover 6 years ago

I actually much prefer this SSD failure mode: Unlike failing spinning rust which will happily linger around coughing up bad data (which will then be written to backups, mirrored drives, etc. potentially creating a huge mess) an SSD going out like a light is comfortingly binary.

评论 #18657352 未加载

doogliusover 6 years ago

Relevant: there is a project called LightNVM [0] which is pushing for a much lower level API to SSDs, that allows most of the complexity to be moved into the host OS (namely, Linux).[0] <a href="http://lightnvm.io" rel="nofollow">http://lightnvm.io</a>

__x0x__over 6 years ago

To add to the anecdata: My most recent SSD failure happened when I did the firmware upgrade. It worked before the upgrade, the upgrade binary said 'upgrade failed' and the disk vanished and never returned after the 'upgrade'.

XorNotover 6 years ago

This post, more then any other, just convinced me to pull out my old Unison file sync configuration (which was really good, looking at it) and get regular syncs to my NAS (which in turn uploads to cloud storage) working properly again.

Shivetyaover 6 years ago

having recently swapped 100tb of spinning media to ssd I am awaiting the first failures. now being a business environment it is all mirrored capacity. so I guess my question is from the article, are they running on a single device? No raid or mirror?I am loathe to even keep my personal data at home on one drive and since I use an iMac that requires me to have time machine as mirroring/etc of the internal drive is not truly possible; at least I did not spend enough time researching it

massafakaover 6 years ago

I think those drives dying quickly is actually a Good Thing™, because chances you're backing up corrupt data might become smaller…With the older drives you sometimes would have a drive die, replace it, restore your backup only to find that in the process of dying the drive was actually corrupting some of the data which went into the backups, now you've got to hunt down the last uncorrupted versions of the data in the backup…

bitLover 6 years ago

Did you try to bake it in the oven ("reflow")? Sometimes you can add a few more hours to its life, enough for backing it up.

评论 #18658480 未加载

lordnachoover 6 years ago

This seems like a the very human problem of trying to grapple with probability.We have all sorts of knowledge about about it but when something happens we're still looking for an explanation for each instance.If you think of it like nuclear decay you'll still be able to say things about the ensemble, but not each individual member.

coreyoconnorover 6 years ago

"When a HD died early, you could also imagine undetected manufacturing flaws that finally gave way. With SSDs, at least in theory that shouldn't happen"This is incorrect. As much of the argument seems predicated on this I don't see a real issue.

Rafuinoover 6 years ago

Keep getting a 403 Forbidden error. Anyone have an archive link they can send my way?

n-gatedotcomover 6 years ago

Two questions- how do major cloud providers (azure,aws,heroku)handle storage failures?What are some best practices for personal hard drive crash warly-warning?

评论 #18656333 未加载

HelloNurseover 6 years ago

TL;DR Lack of noises makes SSD drives bad at motivating users to do backups or use redundant storage: they don't seem to be on the verge of catastrophic failure.

评论 #18656350 未加载

评论 #18656902 未加载

bepvteover 6 years ago

Anyone have a good guide on buying reputable ssds?