This reminds me of Netflix's Chaos Monkey[1]. I'm becoming increasingly convinced that a system that has been exposed to random faults during development and maybe in production as well is the only way to go. It forces one to automate recovery from most failure states, and alert an engineer only when absolutely necessary.<p>[1] <a href="http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html" rel="nofollow">http://techblog.netflix.com/2012/07/chaos-monkey-released-in...</a>