Silent Data Corruption Is Real

319 pointsby Ianvdlabout 8 years ago

20 comments

elfchiefabout 8 years ago

It really bugs me (and has for a while) that there is still no mainstream linux filesystem that supports data block checksumming. Silent corruption is not exactly new, and the odds of running into it have grown significantly as drives have gotten bigger. It's a bit maddening that nobody seems to care (or maybe I'm just looking in the wrong places)(...sure, you could call zfs or btrfs "mainstream", I suppose, but when I say "mainstream" I mean something along the lines of "officially supported by RedHat". zfs isn't, and RH considers btrfs to still be "experimental".)

评论 #13852267 未加载

评论 #13852730 未加载

评论 #13852754 未加载

评论 #13854567 未加载

评论 #13852567 未加载

评论 #13853453 未加载

评论 #13852329 未加载

评论 #13852591 未加载

评论 #13852261 未加载

评论 #13852425 未加载

评论 #13855781 未加载

评论 #13852580 未加载

kabdibabout 8 years ago

Oh, yes. Silent bit errors are tons of fun to track down.I spent a day chasing what turned out to be a bad bit in the cache of a disk drive; bits would get set to zero in random sectors, but always at a specific sector offset. The drive firmware didn't bother doing any kind of memory test; even a simple stuck-at test would have found this and preserved the customer's data.In another case, we had Merkle-tree integrity checking in a file system, to prevent attackers from tampering with data. The unasked-for feature was that it was a memory test, too, and we found a bunch of systems with bad RAM. ECC would have made this a non-issue, but this was consumer-level hardware with very small cost margins.It's fun (well maybe "fun" isn't the right word) to watch the different ways that large populations of systems fail. Crash reports from 50M machines will shake your trust in anything more powerful than a pocket calculator.

评论 #13858285 未加载

nisaabout 8 years ago

ZFS is also crazy good on surviving disks with bad sectors (as long as they still respond fast). Check out this paper: <a href="https://research.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf" rel="nofollow">https://research.cs.wisc.edu/wind/Publications/zfs-corruptio...</a>They even spread the metadata across the disk by default. I'm running on some old WD-Greens with 1500+ of bad sectors and it's cruising along with RAIDZ just fine.There is also failmode=continue where ZFS doesn't hang when it can't read something. If you have a distributed layer above ZFS that also checksums (like HDFS) you can go pretty far even without RAID and quite broken disks. There is also copies=n. When ZFS broke, the disk usally stopped talking or died a few days later. btrs, ext4 just choke and remount ro quite fast (probably the best and correct course of action) but you can tell ZFS to just carry on! Great piece of engineering!

评论 #13854495 未加载

Malicabout 8 years ago

It's articles like this that re-enforce my disappointment that Apple is choosing to NOT implement checksums in their new file system, APFS.<a href="https://news.ycombinator.com/item?id=11934457" rel="nofollow">https://news.ycombinator.com/item?id=11934457</a>

评论 #13855416 未加载

评论 #13854891 未加载

dredmorbiusabout 8 years ago

"Data tends to corrupt. Absolute data tends to corrupt absolutely."In both sense of the word.Many moons ago, in one of my first professional assignments, I was tasked with what was, for the organisation, myself, and the provisioned equipment, a stupidly large data processing task. One of the problems encountered was a failure of a critical hard drive -- this on a system with no concept of a filesystem integrity check (think a particularly culpable damned operating system, and yes, I said that everything about this was stupid). The process of both tracking down, and then demonstrating convincingly to management (I said ...) the nature of the problem was infuriating.And that was with hardware which was reliably and replicably bad. Transient data corruption ... because cosmic rays ... gets to be one of those particularly annoying failure modes.Yes, checksums and redundancy, please.

swinglockabout 8 years ago

If I were to run ZFS on my laptop with a single disk and copies=1, and a file becomes corrupted, can I recover it (partially)?My assumption is the read will fail and the error logged but there is no redundancy so it will stay unreadable.Will ZFS attempt to read the file again, in case the error is transient? If not, can I make ZFS retry reading? Can I "unlock" the file and read it even though it is corrupted, or get a copy of the file? If I restore the file from backup, can ZFS make sure the backup is good using the checksum it expects the file to have?Single disk users seem to be unusual so it's not obvious how to do this, all documentation assumes a highly available installation rather than laptop, but I think there's value in ZFS even with a single disk - if only I understood exactly how it fails and how to scavenge for pieces when it does.

评论 #13854964 未加载

mrbabout 8 years ago

The exact same silent data corruption issues just happened to my 6 x 5TB ZFS FreeBSD fileserver. But unlike what the poster concluded, mine were caused by bad (ECC!) RAM. I kept meticulous notes, so here is my story...I scrub on a weekly basis. One day ZFS started reporting silent errors on disk ada3, just 4kB:<pre><code> pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 4K in 21h05m with 0 errors on Mon Aug 29 20:52:45 2016 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada3 ONLINE 0 0 2 <--- ada4 ONLINE 0 0 0 ada6 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada5 ONLINE 0 0 0 </code></pre> I monitored the situation. But every week, subsequent scrubs would continue to find errors on ada3, and on more data (100-5000kB):<pre><code> 2016-09-05: 1.7MB silently corrupted on ada3 (ST5000DM000-1FK178) 2016-09-12: 5.2MB silently corrupted on ada3 (ST5000DM000-1FK178) 2016-09-19: 300kB silently corrupted on ada3 (ST5000DM000-1FK178) 2016-09-26: 1.8MB silently corrupted on ada3 (ST5000DM000-1FK178) 2016-10-03: 3.1MB silently corrupted on ada3 (ST5000DM000-1FK178) 2016-10-10: 84kB silently corrupted on ada3 (ST5000DM000-1FK178) 2016-10-17: 204kB silently corrupted on ada3 (ST5000DM000-1FK178) 2016-10-24: 388kB silently corrupted on ada3 (ST5000DM000-1FK178) 2016-11-07: 3.9MB silently corrupted on ada3 (ST5000DM000-1FK178) </code></pre> The next week. The server became unreachable during a scrub. I attempted to access the console over IPMI but it just showed a blank screen and was unresponsive. I rebooted it.The next week the server again became unreachable during a scrub. I could access the console over IPMI but the network seemed non-working even though the link was up. I checked the IPMI event logs and saw multiple correctable memory ECC errors:<pre><code> Correctable Memory ECC @ DIMM1A(CPU1) - Asserted </code></pre> The kernel logs reported muliple Machine Check Architecture errors:<pre><code> MCA: Bank 4, Status 0xdc00400080080813 MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 MCA: Vendor "AuthenticAMD", ID 0x100f80, APIC ID 0 MCA: CPU 0 COR OVER BUSLG Source RD Memory MCA: Address 0x5462930 MCA: Misc 0xe00c0f2b01000000 </code></pre> At this point I could not even reboot remotely the server via IPMI. Also, I theorized that in addition to correctable memory ECC errors, maybe the DIMM experienced uncorrectable/undetected ones that were really messing up the OS but also IPMI. So I physically removed the module in "DIMM1A", and the server has been working perfectly well since then.The reason these memory errors always happened on ada3 is not because of a bad drive or bad cables, but likely due to the way FreeBSD allocates buffer memory to cache drive data: the data for ada3 was probably located right on defective physical memory page(s), and the kernel never moves that data around. So it's always ada3 data that seems corrupted.PS: the really nice combinatorial property of raidz2 with 6 drives is that when silent corruption occurs, the kernel has 15 different ways to attempt to rebuild the data ("6 choose 4 = 15").

评论 #13858319 未加载

评论 #13865394 未加载

RX14about 8 years ago

I know for sure that btrfs scrub found 8 correctable errors on my home server filesystem last July. This is obviously great news for me. Contrary to a lot of people here I've personally found btrfs to be really stable (as long as you don't use raid5/6 though).

kev009about 8 years ago

People grossly under-intuit the channel error rate of the SATA. At datacenter scale it's alarmingly high <a href="http://www.enterprisestorageforum.com/imagesvr_ce/8069/sas-sata-table2.jpg" rel="nofollow">http://www.enterprisestorageforum.com/imagesvr_ce/8069/sas-s...</a>

platosrepublicabout 8 years ago

I'm not a database expert, but this seems like something I should worry about, at least a bit. Is this a problem if you store all your persistent data in a database like MySQL?

评论 #13852301 未加载

评论 #13858343 未加载

pmarreckabout 8 years ago

Of course, my ZFS NAS backup is sound until a file that got bitrotted on my non-ZFS computer is touched and then backed up to it :/It's kind of (literally?) like immutability. If you allow even a little mutability, it ruins it.I think all filesystems should be able to add error-correction data to ensure data integrity.

lasermike026about 8 years ago

Shouldn't RAID 1,5,6 protect against data corruption because of disk errors?

评论 #13853193 未加载

评论 #13852451 未加载

blablabloeabout 8 years ago

The story here is not how Silent Data Corruption is real. The story is that somebody did a bad home brew server build and fucked up.So ZFS protects against end-user mistakes.I was really hoping about a story on some large-scale study on silent data corruption, but no, just an ankedote.Sad!:D

评论 #13858757 未加载

评论 #13864306 未加载

meesterdudeabout 8 years ago

interesting find! I wonder what would be a good safeguard to this. I feel like just backing up your data would offer something - but a file could silently change and become corrupted in the backup too.

评论 #13852275 未加载

评论 #13852585 未加载

ksecabout 8 years ago

Yes, it is VERY real. Because no one gives a damn. Most people ( consumers ) would just ignore that corrupted Jpeg.I am in the minority group that gets very frustrated and paranoid when my Video or Photos gets corrupted.Synology has Btrfs on some range of their NAS. But most of them are expensive.I really want a Consumer NAS, or preferably even Time Capsule ( with two 2.5" HDD instead of one drive ) with built in ZFS and ECC Memory, by default weekly scrub drive. And alert you when there is problem.And lastly, do any of those consumer Cloud Storage, OneDrive, DropBox, Amazon, iCloud have these protection in place? Because I would much rather Data Corruption be someone else problem then complexity at my end.

hawskiabout 8 years ago

That gives me more reasons to experiment with DragonFly BSD by building a NAS using HAMMER file system.

greenshackleabout 8 years ago

Uh. My ZFS-backed BSD NAS also has the hostname 'alexandria'.

danjocabout 8 years ago

Closed source firmware on drives contain bugs that corrupt data. Are there any drives, available anywhere, that have open source firmware?

评论 #13852298 未加载

评论 #13853026 未加载

评论 #13853325 未加载

_RPMabout 8 years ago

Just like data black markets.

h2hnabout 8 years ago

I started a really simple and effective project the last month to be able to fix from bitrot in linux(MacOs/Unix?). It's "almost done" just need more real testing and make the systemd service. I've been pretty busy the last weeks so I've only been able to improve the bitrot performance.<a href="https://github.com/liloman/heal-bitrots" rel="nofollow">https://github.com/liloman/heal-bitrots</a>Unfortunatly, btrfs is not stable and zfs needs a "super computer" or at least as much GBs of ECC RAM as you can buy. This solution is designed to any machine and any FS.

评论 #13853084 未加载