<i>Since that incident, I’ve had several other, similar problems. Something would start failing mysteriously, but flushing my cache restored it to normal.</i><p>This seems like a bit of a red flag that in reality something else is actually going wrong with his computer.
To give you an idea of density/frequency of this occurring: my wife's CCD for her PhD experiments routinely (roughly 1 in 5) pick up huge spikes from cosmic rays during her 30-second exposures. The CCD is less than an inch square and she's 2 floors down from ground level.
Reading this, I remember how hard NASA works to get their sattelites and probes secure against cosmic rays, because out there in space, cosmic rays cause your memory to become pretty unpredictable. Error correcting codes and redundancy suddenly become really important, even though you are crammed into this little embedded system which has less processing power than some input devices these days.
I don't say that cosmic rays cannot happen (well, they absolutely certainly do, I mean whether they can cause memory corruption that actually make some difference in the running system), but this is quite strange. No such faults were happening before this single incident and now, there many similar faults happening regularly? Why should I suspect the cosmic rays (was there any reason for such a sudden change in their activity and visibility of it?) and not an hardware fault?
These kinds of memory errors are more often caused by alpha particles emitted by radioactive elements in the chip package: <a href="http://en.wikipedia.org/wiki/Soft_error" rel="nofollow">http://en.wikipedia.org/wiki/Soft_error</a>
For those who want to know more about cosmic rays, Wikipedia is filled with goodness on the subject. (<a href="http://en.wikipedia.org/wiki/Cosmic_ray" rel="nofollow">http://en.wikipedia.org/wiki/Cosmic_ray</a>) I was looking for stats on average density per m2 to determine just how prevalent this effect might be in ground-based electronics. It's been a major problem with high-altitude and satellite equipment for a long, long time.
From my experience, I think it is unlikely to be due to cosmic rays.Most likely culprit could be power supply or data buffers. Those non tantalum capacitors then to end of life faster if you're operating in high humidity conditions.<p>This reminded me of a number of random crashes that a client of my previous company had. Stackdumps just showed random errors. We had about a years worth of crash logs from a couple thousand of network switches (they were an ISP). We initially suggested that this might be a problem with cosmic rays. We even checked the frequency of the random crashes with sunspot cycles. No relationship found. Turns out it was due to another component failing due to a design error.
Great work digging into this issue. A memory test is probably in order.<p>I learned about ECC RAM when I was trying to figure out why server lease deals were so inexpensive relative to others. For instance, the last I checked, hetzner.de's hardware does not support ECC RAM. I am of course not calling out hetzner, and there are other factors in such deals.