When Solid State Drives Are Not That Solid

302 pointsby Shipowalmost 10 years ago

26 comments

ploxilnalmost 10 years ago

Originally TRIM was an un-queued command; all writes had to be flushed, then TRIM executed, then writes could continue. This was bad for performance with automatic on-file-delete trim, so everyone wanted a trim command that could be put in the command queue along with writes. Many new drives have this.It turns out that Samsung 8XX SSDs advertise they support queued trim but it's buggy. The old TRIM command works fine.<a href="https://lkml.org/lkml/2015/6/10/642" rel="nofollow">https://lkml.org/lkml/2015/6/10/642</a>There are in fact lots of "quirks lists" and "blacklists" in the kernel and virtually all computers require some workarounds in the linux kernel for some buggy hardware they have. Pretty amazing when you think about it.EDIT: another closely related example is macbook pro SSDs and NCQ aka native command queuing. They claim they support it, but on many it's buggy. It gets better though; the linux kernel just starting trying to use such functionality by default relatively recently.<a href="https://bugzilla.kernel.org/show_bug.cgi?id=60731" rel="nofollow">https://bugzilla.kernel.org/show_bug.cgi?id=60731</a>these sort of things are, as you can see, very confusing and frustrating to track down, identify, and find a general fix forEDIT2: now that I actually read the kernel bugzilla entry further, it's more recently come to light the actual problem with recent macbook pro SSDs is MSI (efficient type of interrupts)

评论 #9723597 未加载

评论 #9725441 未加载

评论 #9724174 未加载

评论 #9724240 未加载

ChuckMcMalmost 10 years ago

Nice debugging story. When I was at NetApp there were lots of times when drive firmware for the 'less used' options would fail. On the fiber channel drives the 'write zeros' command which was supposed to zero a drive was notorious in its in ability to achieve something that simple. When Google looked at (I don't know if they finally deployed it) the disk encryption technology it worked differently disk to disk and firmware rev to firmware rev. I think it was Brian Pawlowski at NetApp that said "You can count on two things working right in a hard drive, read, write, and seek." The joke being that you needed all three of them to work for reliable disk operation.

teraflopalmost 10 years ago

Here's an Ubuntu bug tracker entry for what sounds like the same problem: <a href="https://bugs.launchpad.net/ubuntu/+source/fstrim/+bug/1449005" rel="nofollow">https://bugs.launchpad.net/ubuntu/+source/fstrim/+bug/144900...</a>Linux 4.0.5 includes a patch that blacklists queued TRIM for the buggy drives. Windows and OS X apparently don't support queued TRIM at all, so they're unaffected.

评论 #9724192 未加载

jlebaralmost 10 years ago

To me, this sort of thing brings home the value of not running your own machines. Sure, Amazon's/Google's clouds have quirks, but it's far less likely that you're going to have to debug faulty hardware in this way. It sounds like a team of more than one person worked on this at least part-time for weeks -- how much is that worth? It's not just the cost of hiring extra people to do the work; often small companies simply can't hire enough good people -- when you do find them, do you want to squander them twiddling servers?

评论 #9724124 未加载

评论 #9723985 未加载

评论 #9724527 未加载

评论 #9724186 未加载

评论 #9725672 未加载

MrBuddyCasinoalmost 10 years ago

Not directly related to TRIM, but AeroSpike has a nice test suite for SSDs, probing for IOPS and latency: <a href="https://github.com/aerospike/act" rel="nofollow">https://github.com/aerospike/act</a>They share their test results for both physical and cloud-based storage, I figured this would be of interest:<a href="http://www.aerospike.com/docs/operations/plan/ssd/ssd_certification.html" rel="nofollow">http://www.aerospike.com/docs/operations/plan/ssd/ssd_certif...</a>

madezalmost 10 years ago

It feels like Samsung used the Linux community here as a free testbed.Samsung knew that only Linux supported queued trim, so releasing it without proper testing is just externalizing the disproportionately increased cost of testing to the Linux community.

评论 #9724202 未加载

评论 #9723747 未加载

cabirumalmost 10 years ago

Strange, Samsung 840/850 evo/pro are considered [1][2] among the best consumer SSDs. The issues article mentions do not exist on Windows, the SSDs are very reliable there. I suspect it's not only Samsung fault. Are we sure Linux handling of TRIM operations is absolutely correct?[1] <a href="http://techreport.com/review/27062/the-ssd-endurance-experiment-only-two-remain-after-1-5pb" rel="nofollow">http://techreport.com/review/27062/the-ssd-endurance-experim...</a>[2] <a href="http://www.anandtech.com/show/8216/samsung-ssd-850-pro-128gb-256gb-1tb-review-enter-the-3d-era/13" rel="nofollow">http://www.anandtech.com/show/8216/samsung-ssd-850-pro-128gb...</a>

评论 #9725367 未加载

评论 #9725053 未加载

评论 #9725718 未加载

sandGorgonalmost 10 years ago

I have this running on my Ubuntu Thinkpad with A Samsung 840 Pro as a weekly cron job. should I turn it off ?<pre><code> #!/bin/sh # call fstrim-all to trim all mounted file systems which support it set -e # This only runs on Intel and Samsung SSDs by default, as some SSDs with faulty # firmware may encounter data loss problems when running fstrim under high I/O # load (e. g. https://launchpad.net/bugs/1259829). You can append the # --no-model-check option here to disable the vendor check and run fstrim on # all SSD drives. exec fstrim-all</code></pre>

评论 #9723623 未加载

notacowardalmost 10 years ago

Pretty disappointing to see some of those Samsung drives on the list, because in some of the other tests/surveys I've seen they seemed to be among the better choices. Sigh I guess Sturgeon's Law applies to SSDs too.

Aardwolfalmost 10 years ago

"Samsung SSD 850 PRO 512GB recently blacklisted as 850 Pro and later in 8-series blacklist"That's what I have in my home computer, with ArchLinux.Do you think this problem only is something particular in the servers of the author of that article, or should this be interpreted as:linux + samsung 850 = you will lose your data?Thanks...

评论 #9726389 未加载

评论 #9729644 未加载

cftalmost 10 years ago

Using SAS SSD drives on a server is a bad idea for many reasons. One should use PCIe cards, that sit directly on the PCIe bus, such as FusionIO or SanDisk. They have been tested and retested (e.g. by Facebook), without the unnecessarily added complexity of SAS/SATA protocols. The I/O performance is also about 20x.

评论 #9724471 未加载

评论 #9724216 未加载

andmariosalmost 10 years ago

Been there, done that. :|Sometime around the end of 2013 I started getting frequently lost data and corrupted filesystems upon reboot. After much search and about 4-6 months into the issue, I found out that the culprit were the queued TRIM commands issued by the linux kernel to my Crucial M500 mSATA disk. The Linux kernel already had a quirks list with many drives, including some of the M500 variants, just not mine.I added my model, compiled the kernel and the nightmare ended. I proceeded to submit a bug report and a patch. The patch got accepted (yay!) and the bug report turned to be very useful for other people with the same problem but different disk as I included the dmesg output that was specific to the issue. This meant that they could now google the errors and get a helpful result.Such is the nature of free software; you are allowed to fix your computer yourself. :)

mrmondoalmost 10 years ago

I've worked on some interesting SSD deployments / experiments a lot over the past 12 months. Quite honestly - I wouldn't go anywhere near Samsung products regardless of their 'PRO' labelling or otherwise.We have had great success with both Sandisk Extreme Pro SATA and Intel DC NVMe series drives, we've also recently deployed a number of Crucial 'Micron' M600 1TB SATA drives that are performing very well and so far haven't given us any issues.

评论 #9724822 未加载

suprjamialmost 10 years ago

What a wonderful story. I wish everyone was this diligent at troubleshooting. Then again, that would put me out of a job.

douglasheriotalmost 10 years ago

Wow, that sucks. Another reason to use ZFS – you’d notice the corrupted files a lot sooner.

评论 #9723314 未加载

评论 #9724921 未加载

评论 #9723319 未加载

microcolonelalmost 10 years ago

I've had issues with these samsung 8xx drives, unfortunately they all happened at once. I gave up on their RMA/warranty process because I was bounced back and forth between the same two numbers a few times. Either side said that the other was in charge of this process(samsung bought the SSD division from seagate... or was it seagate that bought the HDD division from Samsung? To this day I have no clue.).

bbcbasicalmost 10 years ago

I have a Samsung SSD 850 PRO 512GB in my Windows PC. And I have TRIM enabled in Windows:<pre><code> > fsutil.exe behaviour query DisableDeleteNotify DisableDeleteNotify = 0 </code></pre> Should I be worried?

评论 #9723409 未加载

评论 #9723662 未加载

评论 #9723414 未加载

评论 #9724021 未加载

lvsalmost 10 years ago

Can someone clarify the article's claim that these Samsung drives are really "broken" as such? We have a few of these on 3.13 and 3.16 kernels and ext4 with no problems. It seems that there must be something unique to their application in order to expose these trim failures.

评论 #9723322 未加载

Aardwolfalmost 10 years ago

I'm so sick of this TRIM. Constant configurations needed because of it, constant care like "this thing you better don't do on SSDs". And then problems like this.Do you think there'll ever be SSDs that don't need it?

评论 #9724879 未加载

评论 #9725248 未加载

评论 #9725447 未加载

kbar13almost 10 years ago

if one machine failed and failover kicked in correctly, why was the engineer paged?

评论 #9723804 未加载

评论 #9723424 未加载

评论 #9723330 未加载

评论 #9724303 未加载

评论 #9724163 未加载

评论 #9723351 未加载

stream_fusionalmost 10 years ago

I have one of the affected drives mentioned in the article in my development laptop - the Samsung SSD 850 PRO 512GB.As one of the most expensive SSD drives available on the market, it was disconcerting to find dmesg -T showing trim errors, when the drive was mounted with the discard option. Research on mailing lists, indicated that the driver devs, believe it's a Samsung firmware issue.Disabling trim in fstab, stopped the error messages. However it's difficult to get good information about whether drive performance or longevity may be impacted without the trim support.

评论 #9730525 未加载

anigbrowlalmost 10 years ago

Interesting! I sometimes work with SSDs as storage media for cameras (where Sandisk is the most popular brand by a mile) and I seriously doubt any camera firmware is doing drive maintenance. From what I know of digital imaging technicians, neither are they - if a drive starts acting up in any way, the usual policy is to just take it out of service immediately, recover anything that was on it, dump it, and buy a replacement.

sengorkalmost 10 years ago

Given how many Samsung drives are listed in their findings, I can only attribute this to the fact Samsung make their own SSD controllers.

Figsalmost 10 years ago

How do you disable TRIM on common distros? Under Ubuntu, is it just preventing /etc/cron.weekly/fstrim from running, or is there more to it? What about CentOS, etc?

frikalmost 10 years ago

What SSD do cloud hoster like DigitalOcean, Linode, Rackspace, Vultr, etc use?I would some sites trade storage speed for more space (HDDs instead of SSDs).

Supersaiyan_IValmost 10 years ago

Undoubtedly the same issue happened to me on an 500GB 840 EVO with NTFS.SSD zeroed out a part of the disk during runtime, as I watched this happen music was playing from this drive. It was mounted from Ubuntu MATE 15.04 and playing a music library through Audacious. Suddenly music glitched and IO errors began appearing. Rebooted to a DISK READ ERROR (MBR was on the EVO). Ran chkdsk from USB and it showed a ridiculous amount of orphaned files for ca. 1h. Once finished the most frequently accessed files had disappeared. Download folder, Documents folder, some system files. Of course, some of the files could've been recovered had I not ran chkdsk off the bat, bot nonetheless it's an approximate measure of failure impact.I began being suspicious of 840 EVO when sorting old files by date became fantastically slow. If you have a feeling this has happened to you recently - buckle up for a shitstorm.TL;DR Avoid 840 EVO.

评论 #9730845 未加载

26 comments

ploxilnalmost 10 years ago

评论 #9723597 未加载

评论 #9725441 未加载

评论 #9724174 未加载

评论 #9724240 未加载

ChuckMcMalmost 10 years ago

teraflopalmost 10 years ago

评论 #9724192 未加载

jlebaralmost 10 years ago

评论 #9724124 未加载

评论 #9723985 未加载

评论 #9724527 未加载

评论 #9724186 未加载

评论 #9725672 未加载

MrBuddyCasinoalmost 10 years ago

madezalmost 10 years ago

评论 #9724202 未加载

评论 #9723747 未加载

cabirumalmost 10 years ago

评论 #9725367 未加载

评论 #9725053 未加载

评论 #9725718 未加载

sandGorgonalmost 10 years ago

评论 #9723623 未加载

notacowardalmost 10 years ago

Aardwolfalmost 10 years ago

评论 #9726389 未加载

评论 #9729644 未加载

cftalmost 10 years ago

评论 #9724471 未加载

评论 #9724216 未加载

andmariosalmost 10 years ago

mrmondoalmost 10 years ago

评论 #9724822 未加载

suprjamialmost 10 years ago

What a wonderful story. I wish everyone was this diligent at troubleshooting. Then again, that would put me out of a job.

douglasheriotalmost 10 years ago

Wow, that sucks. Another reason to use ZFS – you’d notice the corrupted files a lot sooner.

评论 #9723314 未加载

评论 #9724921 未加载

评论 #9723319 未加载

microcolonelalmost 10 years ago

bbcbasicalmost 10 years ago

评论 #9723409 未加载

评论 #9723662 未加载

评论 #9723414 未加载

评论 #9724021 未加载

lvsalmost 10 years ago

评论 #9723322 未加载

Aardwolfalmost 10 years ago

评论 #9724879 未加载

评论 #9725248 未加载

评论 #9725447 未加载

kbar13almost 10 years ago

if one machine failed and failover kicked in correctly, why was the engineer paged?

评论 #9723804 未加载

评论 #9723424 未加载

评论 #9723330 未加载

评论 #9724303 未加载

评论 #9724163 未加载

评论 #9723351 未加载

stream_fusionalmost 10 years ago

评论 #9730525 未加载

anigbrowlalmost 10 years ago

sengorkalmost 10 years ago

Given how many Samsung drives are listed in their findings, I can only attribute this to the fact Samsung make their own SSD controllers.

Figsalmost 10 years ago

How do you disable TRIM on common distros? Under Ubuntu, is it just preventing /etc/cron.weekly/fstrim from running, or is there more to it? What about CentOS, etc?

frikalmost 10 years ago

What SSD do cloud hoster like DigitalOcean, Linode, Rackspace, Vultr, etc use?I would some sites trade storage speed for more space (HDDs instead of SSDs).

Supersaiyan_IValmost 10 years ago

评论 #9730845 未加载