There is a bunch of good advice here, but it's missed the most useful principal in my experience, probably because the motivating example is too small in scope:<p><i>The way to build reliable software systems is to have multiple independent paths to success.</i><p>This is the Erlang "let it crash" strategy restated, but I've also found it embodied in things like the architecture of Google Search, Tandem Computer, Ethereum, RAID 5, the Space Shuttle, etc. Basically, you achieve reliability through redundancy. For any given task, compute the answer multiple times in parallel, ideally in multiple independent ways. If the answer agrees, great, you're done. If not, have some consensus mechanism to detect the true answer. If you can't compute the answer in parallel, or you still don't get one back, retry.<p>The reason for this is simply math. If you have n different events that must all go right to achieve success, the chance of this happening is x1 * x2 * ... * xn. This product goes to zero very quickly - if you have 20 components connected in series that are all 98% reliable, the chance of success is only 2/3. If instead you have n different events where <i>any</i> one can go right to achieve success, the chance of success is 1 - (1 - y1) * (1 - y2) * ... * (1 - yn). <i>This</i> inverse actually increases as the number of alternate pathways to success goes up and fast. If you have 3 alternatives each of which has just an 80% chance of success, but any of the 3 will work, then doing them all in parallel has a 97% chance of success.<p>This is why complex software systems that must stay up are built with redundancy, replicas, failover, retries, and other similar mechanisms in place. And the presence of those mechanisms usually trumps anything you can do to increase the reliability of individual components, simply because you get diminishing returns to carefulness. You might spend 100x more resources to go from 90% reliability to 99% reliability, but if you can identify a system boundary and correctness check, you can get that 99% reliability simply by having 2 teams each build a subsystem that is 90% reliable and checking that their answers agree.