Maybe I'm oversimplifying things, but why haven't these companies distributed their compute resources across various facilities and cloud providers, enabled instant failover, and tested this before outages like these?
It's a testament to how successful Amazon has been in its cloud offering. We're used to sites going down for one reason or another. What's weird is that Amazon's success has made all these failures so correlated. It's a strange feeling when many sites you like all fail at once.
and here I was thinking the change I just rolled out to our EC2 instances had boned our test environment. Two failures in a week? Does not really inspire confidence right now :(