Originally TRIM was an un-queued command; all writes had to be flushed, then TRIM executed, then writes could continue. This was bad for performance with automatic on-file-delete trim, so everyone wanted a trim command that could be put in the command queue along with writes. Many new drives have this.<p>It turns out that Samsung 8XX SSDs advertise they support queued trim but it's buggy. The old TRIM command works fine.<p><a href="https://lkml.org/lkml/2015/6/10/642" rel="nofollow">https://lkml.org/lkml/2015/6/10/642</a><p>There are in fact lots of "quirks lists" and "blacklists" in the kernel and virtually all computers require some workarounds in the linux kernel for some buggy hardware they have. Pretty amazing when you think about it.<p>EDIT: another closely related example is macbook pro SSDs and NCQ aka native command queuing. They claim they support it, but on many it's buggy. It gets better though; the linux kernel just starting trying to use such functionality by default relatively recently.<p><a href="https://bugzilla.kernel.org/show_bug.cgi?id=60731" rel="nofollow">https://bugzilla.kernel.org/show_bug.cgi?id=60731</a><p>these sort of things are, as you can see, very confusing and frustrating to track down, identify, and find a general fix for<p>EDIT2: now that I actually read the kernel bugzilla entry further, it's more recently come to light the actual problem with recent macbook pro SSDs is MSI (efficient type of interrupts)
Nice debugging story. When I was at NetApp there were lots of times when drive firmware for the 'less used' options would fail. On the fiber channel drives the 'write zeros' command which was supposed to zero a drive was notorious in its in ability to achieve something that simple. When Google looked at (I don't know if they finally deployed it) the disk encryption technology it worked differently disk to disk and firmware rev to firmware rev. I think it was Brian Pawlowski at NetApp that said "You can count on two things working right in a hard drive, read, write, and seek." The joke being that you needed all three of them to work for reliable disk operation.
Here's an Ubuntu bug tracker entry for what sounds like the same problem: <a href="https://bugs.launchpad.net/ubuntu/+source/fstrim/+bug/1449005" rel="nofollow">https://bugs.launchpad.net/ubuntu/+source/fstrim/+bug/144900...</a><p>Linux 4.0.5 includes a patch that blacklists queued TRIM for the buggy drives. Windows and OS X apparently don't support queued TRIM at all, so they're unaffected.
To me, this sort of thing brings home the value of not running your own machines. Sure, Amazon's/Google's clouds have quirks, but it's far less likely that you're going to have to debug faulty hardware in this way. It sounds like a team of more than one person worked on this at least part-time for weeks -- how much is that worth? It's not just the cost of hiring extra people to do the work; often small companies simply can't hire enough good people -- when you do find them, do you want to squander them twiddling servers?
Not directly related to TRIM, but AeroSpike has a nice test suite for SSDs, probing for IOPS and latency: <a href="https://github.com/aerospike/act" rel="nofollow">https://github.com/aerospike/act</a><p>They share their test results for both physical and cloud-based storage, I figured this would be of interest:<p><a href="http://www.aerospike.com/docs/operations/plan/ssd/ssd_certification.html" rel="nofollow">http://www.aerospike.com/docs/operations/plan/ssd/ssd_certif...</a>
It feels like Samsung used the Linux community here as a free testbed.<p>Samsung knew that only Linux supported queued trim, so releasing it without proper testing is just externalizing the disproportionately increased cost of testing to the Linux community.
Strange, Samsung 840/850 evo/pro are considered [1][2] among the best consumer SSDs. The issues article mentions do not exist on Windows, the SSDs are very reliable there. I suspect it's not only Samsung fault. Are we sure Linux handling of TRIM operations is absolutely correct?<p>[1] <a href="http://techreport.com/review/27062/the-ssd-endurance-experiment-only-two-remain-after-1-5pb" rel="nofollow">http://techreport.com/review/27062/the-ssd-endurance-experim...</a><p>[2] <a href="http://www.anandtech.com/show/8216/samsung-ssd-850-pro-128gb-256gb-1tb-review-enter-the-3d-era/13" rel="nofollow">http://www.anandtech.com/show/8216/samsung-ssd-850-pro-128gb...</a>
I have this running on my Ubuntu Thinkpad with A Samsung 840 Pro as a weekly cron job. should I turn it off ?<p><pre><code> #!/bin/sh
# call fstrim-all to trim all mounted file systems which support it
set -e
# This only runs on Intel and Samsung SSDs by default, as some SSDs with faulty
# firmware may encounter data loss problems when running fstrim under high I/O
# load (e. g. https://launchpad.net/bugs/1259829). You can append the
# --no-model-check option here to disable the vendor check and run fstrim on
# all SSD drives.
exec fstrim-all</code></pre>
Pretty disappointing to see some of those Samsung drives on the list, because in some of the other tests/surveys I've seen they seemed to be among the better choices. <i>Sigh</i> I guess Sturgeon's Law applies to SSDs too.
"Samsung SSD 850 PRO 512GB
recently blacklisted as 850 Pro and later in 8-series blacklist"<p>That's what I have in my home computer, with ArchLinux.<p>Do you think this problem only is something particular in the servers of the author of that article, or should this be interpreted as:<p>linux + samsung 850 = you will lose your data?<p>Thanks...
Using SAS SSD drives on a server is a bad idea for many reasons. One should use PCIe cards, that sit directly on the PCIe bus, such as FusionIO or SanDisk. They have been tested and retested (e.g. by Facebook), without the unnecessarily added complexity of SAS/SATA protocols. The I/O performance is also about 20x.
Been there, done that. :|<p>Sometime around the end of 2013 I started getting frequently lost data and corrupted filesystems upon reboot.
After much search and about 4-6 months into the issue, I found out that the culprit were the queued TRIM commands issued by the linux kernel to my Crucial M500 mSATA disk. The Linux kernel already had a quirks list with many drives, including some of the M500 variants, just not mine.<p>I added my model, compiled the kernel and the nightmare ended. I proceeded to submit a bug report and a patch. The patch got accepted (yay!) and the bug report turned to be very useful for other people with the same problem but different disk as I included the dmesg output that was specific to the issue. This meant that they could now google the errors and get a helpful result.<p>Such is the nature of free software; you are allowed to fix your computer yourself. :)
I've worked on some interesting SSD deployments / experiments a lot over the past 12 months. Quite honestly - I wouldn't go anywhere near Samsung products regardless of their 'PRO' labelling or otherwise.<p>We have had great success with both Sandisk Extreme Pro SATA and Intel DC NVMe series drives, we've also recently deployed a number of Crucial 'Micron' M600 1TB SATA drives that are performing very well and so far haven't given us any issues.
I've had issues with these samsung 8xx drives, unfortunately they all happened at once. I gave up on their RMA/warranty process because I was bounced back and forth between the same two numbers a few times. Either side said that the other was in charge of this process(samsung bought the SSD division from seagate... or was it seagate that bought the HDD division from Samsung? To this day I have no clue.).
I have a Samsung SSD 850 PRO 512GB in my Windows PC. And I have TRIM enabled in Windows:<p><pre><code> > fsutil.exe behaviour query DisableDeleteNotify
DisableDeleteNotify = 0
</code></pre>
Should I be worried?
Can someone clarify the article's claim that these Samsung drives are really "broken" as such? We have a few of these on 3.13 and 3.16 kernels and ext4 with no problems. It seems that there must be something unique to their application in order to expose these trim failures.
I'm so sick of this TRIM. Constant configurations needed because of it, constant care like "this thing you better don't do on SSDs". And then problems like this.<p>Do you think there'll ever be SSDs that don't need it?
I have one of the affected drives mentioned in the article in my development laptop - the Samsung SSD 850 PRO 512GB.<p>As one of the most expensive SSD drives available on the market, it was disconcerting to find dmesg -T showing trim errors, when the drive was mounted with the discard option. Research on mailing lists, indicated that the driver devs, believe it's a Samsung firmware issue.<p>Disabling trim in fstab, stopped the error messages. However it's difficult to get good information about whether drive performance or longevity may be impacted without the trim support.
Interesting! I sometimes work with SSDs as storage media for cameras (where Sandisk is the most popular brand by a mile) and I seriously doubt any camera firmware is doing drive maintenance. From what I know of digital imaging technicians, neither are they - if a drive starts acting up in any way, the usual policy is to just take it out of service immediately, recover anything that was on it, dump it, and buy a replacement.
How do you disable TRIM on common distros? Under Ubuntu, is it just preventing /etc/cron.weekly/fstrim from running, or is there more to it? What about CentOS, etc?
What SSD do cloud hoster like DigitalOcean, Linode, Rackspace, Vultr, etc use?<p>I would some sites trade storage speed for more space (HDDs instead of SSDs).
Undoubtedly the same issue happened to me on an 500GB 840 EVO with NTFS.<p>SSD zeroed out a part of the disk during runtime, as I watched this happen music was playing from this drive. It was mounted from Ubuntu MATE 15.04 and playing a music library through Audacious. Suddenly music glitched and IO errors began appearing. Rebooted to a DISK READ ERROR (MBR was on the EVO). Ran chkdsk from USB and it showed a ridiculous amount of orphaned files for ca. 1h. Once finished the <i>most frequently accessed</i> files had disappeared. Download folder, Documents folder, some system files. Of course, some of the files could've been recovered had I not ran chkdsk off the bat, bot nonetheless it's an approximate measure of failure impact.<p>I began being suspicious of 840 EVO when sorting old files by date became fantastically slow. If you have a feeling this has happened to you recently - buckle up for a shitstorm.<p>TL;DR Avoid 840 EVO.