I think this is another good example of how we as an industry are still unable to adequately assess risk properly.<p>I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.<p>Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a "16-sigma event", it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn't failover properly, because of a misconfiguration.<p>You would think after decades that there would be a better analysis and higher-quality "best practices", but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don't understand risk, and that there is an opportunity for "consultants" who understand this, kind of like security consultants?