The exact same silent data corruption issues just happened to my 6 x 5TB ZFS FreeBSD fileserver. But unlike what the poster concluded, mine were caused by bad (ECC!) RAM. I kept meticulous notes, so here is my story...<p>I scrub on a weekly basis. One day ZFS started reporting silent errors on disk ada3, just 4kB:<p><pre><code> pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub repaired 4K in 21h05m with 0 errors on Mon Aug 29 20:52:45 2016
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
ada3 ONLINE 0 0 2 <---
ada4 ONLINE 0 0 0
ada6 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada5 ONLINE 0 0 0
</code></pre>
I monitored the situation. But every week, subsequent scrubs would continue to find errors on ada3, and on more data (100-5000kB):<p><pre><code> 2016-09-05: 1.7MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-12: 5.2MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-19: 300kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-09-26: 1.8MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-03: 3.1MB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-10: 84kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-17: 204kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-10-24: 388kB silently corrupted on ada3 (ST5000DM000-1FK178)
2016-11-07: 3.9MB silently corrupted on ada3 (ST5000DM000-1FK178)
</code></pre>
The next week. The server became unreachable during a scrub. I attempted to access the console over IPMI but it just showed a blank screen and was unresponsive. I rebooted it.<p>The next week the server again became unreachable during a scrub. I could access the console over IPMI but the network seemed non-working even though the link was up. I checked the IPMI event logs and saw multiple <i>correctable</i> memory ECC errors:<p><pre><code> Correctable Memory ECC @ DIMM1A(CPU1) - Asserted
</code></pre>
The kernel logs reported muliple Machine Check Architecture errors:<p><pre><code> MCA: Bank 4, Status 0xdc00400080080813
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f80, APIC ID 0
MCA: CPU 0 COR OVER BUSLG Source RD Memory
MCA: Address 0x5462930
MCA: Misc 0xe00c0f2b01000000
</code></pre>
At this point I could not even reboot remotely the server via IPMI. Also, I theorized that in addition to <i>correctable</i> memory ECC errors, maybe the DIMM experienced <i>uncorrectable/undetected</i> ones that were really messing up the OS but also IPMI. So I physically removed the module in "DIMM1A", and the server has been working perfectly well since then.<p>The reason these memory errors always happened on ada3 is not because of a bad drive or bad cables, but likely due to the way FreeBSD allocates buffer memory to cache drive data: the data for ada3 was probably located right on defective physical memory page(s), and the kernel never moves that data around. So it's always ada3 data that seems corrupted.<p>PS: the really nice combinatorial property of raidz2 with 6 drives is that when silent corruption occurs, the kernel has 15 different ways to attempt to rebuild the data ("6 choose 4 = 15").