I think this is another good example of how we as an industry are still unable to adequately assess risk properly.<p>I'm fairly certain that the higher-ups in Twitter weren't told "We have pretty good failover protection, but there is a small risk of catastrophic failure where everything will go completely down." Whoever was in charge of disaster recovery obviously didn't really understand the risk.<p>Just like the recent outages of Heroku and EC2, and just like the financial crisis of 2008 which was laughably called a "16-sigma event", it seems pretty clear that the actual assessment of risk is pretty poor. The way that Heroku failed, where invalid data in a stream caused failure, and the way that EC2 failed, where a single misconfigured device caused widespread failure, just shows that the entire area of risk management is still in its infancy. My employer went down globally for an entire day because of an electrical grid problem, and the diesel generators didn't failover properly, because of a misconfiguration.<p>You would think after decades that there would be a better analysis and higher-quality "best practices", but it still appears to be rather immature at this stage. Is this because the assessment of risk at a company is left to people that don't understand risk, and that there is an opportunity for "consultants" who understand this, kind of like security consultants?
I think that a lot of you guys are confusing "Disaster Recovery" with "Business Continuity".<p>Disaster Recovery is a reactive approach. It's what you do to get things back up AFTER a system or site has failed.<p>Business Continuity is a proactive approach. It's what you do to ensure that your critical services will remain viable whenever disaster occurs.<p>In the cases of Heroku, Amazon, Twitter, and many more, their Disaster Recovery strategies have been successful. The fact that they came back online without major data loss is proof of that. Their business continuity strategies, however, have been found wanting.
I hope they write up a post-mortem on the fallout (hopefully it won't be a post-mortem of Twitter). Those things are always extremely interesting with big infrastructure like this.
Always fun when you're developing against an API, and then have to perform a frantic investigation to work out if your latest code change broke <i>everything</i>... or it's just the API endpoint itself.
"Today's turbulence explained" was just posted: <a href="http://blog.twitter.com/2012/06/todays-turbulence-explained.html" rel="nofollow">http://blog.twitter.com/2012/06/todays-turbulence-explained....</a><p>Unfortunately there are no details, it just says "there was a cascading bug in one of our infrastructure components".
Twitter Status - <a href="http://status.twitter.com/" rel="nofollow">http://status.twitter.com/</a><p>No news at the status site either, that beats the purpose of having a dedicated status site.
I'm glad this made it to the front page. Is the topic itself newsworthy? Not on its own. Is all the discussion that's flooding into this thread worth having?<p>Yep. Even the subthread from the person complaining that this isn't newsworthy.
this still applies: <a href="http://blog.pinboard.in/2011/12/don_t_be_a_free_user/" rel="nofollow">http://blog.pinboard.in/2011/12/don_t_be_a_free_user/</a>
If you're debugging webservices that suddenly slow down (timeouts of 10s), this may be your cause if they depend on s.twitter.com, search.twitter.com or api.twitter.com.<p>As a workaround for those systems, add s.twitter.com, search.twitter.com and api.twitter.com in your /etc/hosts file that map back to 127.0.0.1.<p>This obviously breaks Twitter integration, but it also makes sure page loads don't explode when waiting for remote resources.
the mobile site is still up. <a href="http://m.twitter.com/" rel="nofollow">http://m.twitter.com/</a> You can tweet and everything. The streaming API is also still partly up.
People are going to have to resort to [desperate measures](<a href="http://nedroid.com/2009/05/people-have-to-know/" rel="nofollow">http://nedroid.com/2009/05/people-have-to-know/</a>).
Why is this on the HN front page? This is an entirely worthless post. It adds no value. Nobody is going to reread this at any point in the future. Utterly worthless.