It really bugs me (and has for a while) that there is still no mainstream linux filesystem that supports data block checksumming. Silent corruption is not exactly new, and the odds of running into it have grown significantly as drives have gotten bigger. It's a bit maddening that nobody seems to care (or maybe I'm just looking in the wrong places)<p>(...sure, you could call zfs or btrfs "mainstream", I suppose, but when I say "mainstream" I mean something along the lines of "officially supported by RedHat". zfs isn't, and RH considers btrfs to still be "experimental".)
Oh, yes. Silent bit errors are tons of fun to track down.<p>I spent a day chasing what turned out to be a bad bit in the cache of a disk drive; bits would get set to zero in random sectors, but always at a specific sector offset. The drive firmware didn't bother doing any kind of memory test; even a simple stuck-at test would have found this and preserved the customer's data.<p>In another case, we had Merkle-tree integrity checking in a file system, to prevent attackers from tampering with data. The unasked-for feature was that it was a memory test, too, and we found a bunch of systems with bad RAM. ECC would have made this a non-issue, but this was consumer-level hardware with very small cost margins.<p>It's fun (well maybe "fun" isn't the right word) to watch the different ways that large populations of systems fail. Crash reports from 50M machines will shake your trust in anything more powerful than a pocket calculator.
ZFS is also crazy good on surviving disks with bad sectors (as long as they still respond fast). Check out this paper: <a href="https://research.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf" rel="nofollow">https://research.cs.wisc.edu/wind/Publications/zfs-corruptio...</a><p>They even spread the metadata across the disk by default. I'm running on some old WD-Greens with 1500+ of bad sectors and it's cruising along with RAIDZ just fine.<p>There is also failmode=continue where ZFS doesn't hang when it can't read something. If you have a distributed layer above ZFS that also checksums (like HDFS) you can go pretty far even without RAID and quite broken disks. There is also copies=n. When ZFS broke, the disk usally stopped talking or died a few days later. btrs, ext4 just choke and remount ro quite fast (probably the best and correct course of action) but you can tell ZFS to just carry on! Great piece of engineering!
It's articles like this that re-enforce my disappointment that Apple is choosing to NOT implement checksums in their new file system, APFS.<p><a href="https://news.ycombinator.com/item?id=11934457" rel="nofollow">https://news.ycombinator.com/item?id=11934457</a>
"Data tends to corrupt. Absolute data tends to corrupt absolutely."<p>In both sense of the word.<p>Many moons ago, in one of my first professional assignments, I was tasked with what was, for the organisation, myself, and the provisioned equipment, a stupidly large data processing task. One of the problems encountered was a failure of a critical hard drive -- this on a system with no concept of a filesystem integrity check (think a particularly culpable damned operating system, and yes, I said that everything about this was stupid). The process of both tracking down, and then demonstrating convincingly to management (I <i>said</i> ...) the nature of the problem was infuriating.<p>And that was with hardware which was reliably and replicably bad. Transient data corruption ... because cosmic rays ... gets to be one of those particularly annoying failure modes.<p>Yes, checksums and redundancy, please.
If I were to run ZFS on my laptop with a single disk and copies=1, and a file becomes corrupted, can I recover it (partially)?<p>My assumption is the read will fail and the error logged but there is no redundancy so it will stay unreadable.<p>Will ZFS attempt to read the file again, in case the error is transient? If not, can I make ZFS retry reading? Can I "unlock" the file and read it even though it is corrupted, or get a copy of the file? If I restore the file from backup, can ZFS make sure the backup is good using the checksum it expects the file to have?<p>Single disk users seem to be unusual so it's not obvious how to do this, all documentation assumes a highly available installation rather than laptop, but I think there's value in ZFS even with a single disk - if only I understood exactly how it fails and how to scavenge for pieces when it does.
The exact same silent data corruption issues just happened to my 6 x 5TB ZFS FreeBSD fileserver. But unlike what the poster concluded, mine were caused by bad (ECC!) RAM. I kept meticulous notes, so here is my story...<p>I scrub on a weekly basis. One day ZFS started reporting silent errors on disk ada3, just 4kB:<p><pre><code> pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 4K in 21h05m with 0 errors on Mon Aug 29 20:52:45 2016
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada3 ONLINE 0 0 2 <---
ada4 ONLINE 0 0 0
ada6 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada5 ONLINE 0 0 0
</code></pre>
I monitored the situation. But every week, subsequent scrubs would continue to find errors on ada3, and on more data (100-5000kB):<p><pre><code> 2016-09-05: 1.7MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-12: 5.2MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-19: 300kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-26: 1.8MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-03: 3.1MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-10: 84kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-17: 204kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-24: 388kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-11-07: 3.9MB silently corrupted on ada3 (ST5000DM000-1FK178)
</code></pre>
The next week. The server became unreachable during a scrub. I attempted to access the console over IPMI but it just showed a blank screen and was unresponsive. I rebooted it.<p>The next week the server again became unreachable during a scrub. I could access the console over IPMI but the network seemed non-working even though the link was up. I checked the IPMI event logs and saw multiple <i>correctable</i> memory ECC errors:<p><pre><code> Correctable Memory ECC @ DIMM1A(CPU1) - Asserted
</code></pre>
The kernel logs reported muliple Machine Check Architecture errors:<p><pre><code> MCA: Bank 4, Status 0xdc00400080080813
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f80, APIC ID 0
MCA: CPU 0 COR OVER BUSLG Source RD Memory
MCA: Address 0x5462930
MCA: Misc 0xe00c0f2b01000000
</code></pre>
At this point I could not even reboot remotely the server via IPMI. Also, I theorized that in addition to <i>correctable</i> memory ECC errors, maybe the DIMM experienced <i>uncorrectable/undetected</i> ones that were really messing up the OS but also IPMI. So I physically removed the module in "DIMM1A", and the server has been working perfectly well since then.<p>The reason these memory errors always happened on ada3 is not because of a bad drive or bad cables, but likely due to the way FreeBSD allocates buffer memory to cache drive data: the data for ada3 was probably located right on defective physical memory page(s), and the kernel never moves that data around. So it's always ada3 data that seems corrupted.<p>PS: the really nice combinatorial property of raidz2 with 6 drives is that when silent corruption occurs, the kernel has 15 different ways to attempt to rebuild the data ("6 choose 4 = 15").
I know for sure that btrfs scrub found 8 correctable errors on my home server filesystem last July. This is obviously great news for me. Contrary to a lot of people here I've personally found btrfs to be really stable (as long as you don't use raid5/6 though).
People grossly under-intuit the channel error rate of the SATA. At datacenter scale it's alarmingly high <a href="http://www.enterprisestorageforum.com/imagesvr_ce/8069/sas-sata-table2.jpg" rel="nofollow">http://www.enterprisestorageforum.com/imagesvr_ce/8069/sas-s...</a>
I'm not a database expert, but this seems like something I should worry about, at least a bit. Is this a problem if you store all your persistent data in a database like MySQL?
Of course, my ZFS NAS backup is sound until a file that got bitrotted on my non-ZFS computer is touched and then backed up to it :/<p>It's kind of (literally?) like immutability. If you allow even a little mutability, it ruins it.<p>I think all filesystems should be able to add error-correction data to ensure data integrity.
The story here is not how Silent Data Corruption is real.
The story is that somebody did a bad home brew server build and fucked up.<p>So ZFS protects against end-user mistakes.<p>I was really hoping about a story on some large-scale study on silent data corruption, but no, just an ankedote.<p>Sad!<p>:D
interesting find! I wonder what would be a good safeguard to this. I feel like just backing up your data would offer something - but a file could silently change and become corrupted in the backup too.
Yes, it is VERY real. Because no one gives a damn. Most people ( consumers ) would just ignore that corrupted Jpeg.<p>I am in the minority group that gets very frustrated and paranoid when my Video or Photos gets corrupted.<p>Synology has Btrfs on some range of their NAS. But most of them are expensive.<p>I really want a Consumer NAS, or preferably even Time Capsule ( with two 2.5" HDD instead of one drive ) with built in ZFS and ECC Memory, by default weekly scrub drive. And alert you when there is problem.<p>And lastly, do any of those consumer Cloud Storage, OneDrive, DropBox, Amazon, iCloud have these protection in place? Because I would much rather Data Corruption be someone else problem then complexity at my end.
I started a really simple and effective project the last month to be able to fix from bitrot in linux(MacOs/Unix?). It's "almost done" just need more real testing and make the systemd service. I've been pretty busy the last weeks so I've only been able to improve the bitrot performance.<p><a href="https://github.com/liloman/heal-bitrots" rel="nofollow">https://github.com/liloman/heal-bitrots</a><p>Unfortunatly, btrfs is not stable and zfs needs a "super computer" or at least as much GBs of ECC RAM as you can buy. This solution is designed to any machine and any FS.