These kinds of domino effects are one reason why scalability is so hard to get right. It reminds me of precipitation in supersaturated solutions. Everything seems normal until you reach some unforeseen tipping point, and then all hell breaks loose.<p>I like his little veiled pitch for Google's services when he talks about how easy it was to bring more request routers online given their elastic architecture. It makes me wonder why that elasticity isn't automated -- more routers should <i>automatically</i> be brought online if any routers hit their maximum load.
Wow, I was impressed by how closely this mea culpa was the same as Amazon's when they had that big S3 outage:<p>Compare to:<p><a href="http://developer.amazonwebservices.com/connect/message.jspa?messageID=79978#79978" rel="nofollow">http://developer.amazonwebservices.com/connect/message.jspa?...</a>
I admire the transparency but I don't pretend for a second it's the whole story. This happened during work hours and if they indeed did get notified so fast, I'm wondering why it took over 90 minutes to recover.<p>Also, the outage, for me anyway, seemed to last much longer then the stated 100 minutes. I seem to remember being unable to access GMail for a span of about 3 hours today.
Its nice to see them being so transparent about what happened and how they plan on fixing it in the future. They're obviously working on anticipating problems in the future, but what I wonder about is things like this, where they thought they were covered. How does one go about finding these failure points on systems that span multiple locations? I hope they followup with lessons learned on their quest to improve reliability.
This is probably the most glaring flaw in SaaS and cloud computing. Even the giants go down eventually. Couple that with your own ISP's issues and your potential downtime is doubled.