The conclusion of the article is mostly false at least for Linux: UE errors have no reasons to panic the machine in all cases, and by default under Linux the affected processes are simply killed. Of course, if this is kernel memory, then you will panic, but the probability of it being kernel memory is low (amount of kernel memory / total memory...). This has been signaled in the comments (not by me) but unfortunately the article has not been updated to reflect that fact. Also, there is no reason that the policy is not managed by the software, so as long as it is detected, the kernel is free to do what it wants with UE, and everything is fine.<p>Also I have no proof that any crazy thing can not happen, but there is no reason for single bit errors not to be corrected regardless of the OS. The worse that should happen for them is to not be reported.<p>IMO if you have the opportunity (the category of HW you want supports it) you would be crazy not to use ECC RAM. Non-ECC RAM is basically the only component in a PC that is not protected. Obvious weak point. I've been beaten at least twice (two defective components, way more than 2 errors before I figured out what was happening) only on computers I was <i>directly</i> owning or using at work (among a total of a dozen of computers). Now I don't want to loose my time anymore, I always use ECC memory when possible (I'm not going to pay a computer twice the price just for that, so it is a "little" difficult with laptops which also have a plethora of other choice criteria, but it is very easy to get affordable workstation desktop computers with ECC)<p>No modern digital communication bus will be designed without any form of protection, so this make not much sense to have computers without ECC RAM. I would even like to have it on smartphones, but unfortunately I doubt this will happen soon.
I would actually prefer if the uncorrected memory exception were handled by the operating system.<p>I would far more prefer that the affected program(s) have a chance to react, or be killed as a subset of the system. If the error occurred in a filesystem context there may be other ways of correcting the issue (particularly if it's merely in read cache instead of write cache).<p>Obviously unhandeled exceptions should cascade until they are either contained or until the entire system halts.
> Since we don't have our own particle accelerator to bombard the memory modules with in order to cause radiation-based errors<p>I really want to see someone get some radioisotopes and place them next to both ECC and non-ECC RAM (while forcing reads and writes to the affected memory) to see what sort of soft errors / SEUs happen.
It's a bit terrible that the author implies ZFS is more susceptible to bit errors <i>because</i> it scrubs data, and any errors will make it go haywire. As opposed to other systems like NTFS/ext4 which presumably cope fine with undetected but errors...