Our instances with ap-southeast-2 were out for around 12 hours. We used multiple availability zones and it didn't prevent downtime at all. It's very interesting the difference between AWS and Google outage responses. AWS is down for 12+ hours for some customers, force each customer to chase service level credits and sign off the postmortem with a nameless & faceless "-The AWS Team". Not one person at AWS was willing to take responsibility for this failure.<p>Whereas Google was recently down for less than 18 minutes. A VP at Google sent an email advising all affected customers, posted continuous updates to their status page, sent a further apology email at the conclusion, posted a service credit exceeding the SLA to all customers in the zone (without forcing customers to chase this themselves with billing) and lastly wrote one of the most well written post mortems I've ever seen. AWS has much to learn from Google about how to handle outages properly.
Why, oh why do they report times in PDT rather than AEST (the zone of the affected area) or UTC (the standard everything else is based on)?<p>(Mutter, mutter, … something about Americans and their timezones … and northern hemispherians and their seasons …)
For those wondering what the "severe weather" was:<p>* <a href="http://www.smh.com.au/national/australias-wild-weather-sydneys-massive-storm-in-pictures-20160606-gpcyu7.html" rel="nofollow">http://www.smh.com.au/national/australias-wild-weather-sydne...</a><p>* <a href="http://www.abc.net.au/news/2016-06-07/sydney-weather-storm-damaged-beachfront-homes-likely-dismantled/7487056" rel="nofollow">http://www.abc.net.au/news/2016-06-07/sydney-weather-storm-d...</a><p>* <a href="http://www.sbs.com.au/news/gallery/pictures-wild-weather-savages-nsw-and-tasmania" rel="nofollow">http://www.sbs.com.au/news/gallery/pictures-wild-weather-sav...</a>
Heh - I love the image in my head of the flywheel providing a few extra seconds of power to the coffee urn in the Blackwoods warehouse out the back and to all the fan heaters and big screen TVs in Toongabbie - just as Foxtel, Dominos, and Channel 9's Nagios dashboards all start turning red and their ops staff phones start beeping.
>> The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage)<p>It is false to assume that the state of the electrical supply is either on or off. This may come as a surprise, but not to me. In 2008, Eskom (South Africa's electricity suppliers) experienced similar faults. The mains supply voltage is 220v here. At one point, some devices started to fail in my house, and others, such as lights, continued to work, but significantly dimmer. We measured 180v at the plugs. There were similar outages in my area last year, where an outright cut-off was preceded by voltage drops. This outage is interesting because it is an example of a bug owing to false assumptions!<p>There have also been incidences where certain cables have been stolen [1] and that has caused the opposite: voltage spikes.<p>[1] I couldn't tell you which, or what kind, but I remember it has something to do with "the neutral"
I love reading about problems like these, it's great that Amazon is forthcoming about them. There's always some new wrinkle.<p>E.g. in this case, in normal operation, power from the utility power grid spins a flywheel. When the grid fails, the flywheel provides a holdover until Amazon's diesel generators can start.<p>But in this failure the voltage from the grid sagged, rather than going away completely. The breaker isolating the flywheel from the grid didn't open quickly enough. So power from the flywheel was sent out to the grid. It didn't succeed in powering the grid for very long. Oops.
I'm a bit dubious about their "if you used multi-AZ you'll be fine" when I had multiple outages in a multi-AZ Elastic Beanstalk application of over an hour. Methinks the load balancers aren't as magical as they'd like to make out.
I knew that this was a big event when it happened last Sunday, because the AWS service status page had a yellow triangle rather than a green tick. Usually when they have an outage, they just put a tiny blue 'i' on the green tick...
There is something Orwellian about referring to this as a 'service event'.<p>I am reminded of 'The Event' from That Mitchell and Webb Look [0]. We don't talk about The Event.<p><a href="https://www.youtube.com/watch?v=wnd1jKcfBRE" rel="nofollow">https://www.youtube.com/watch?v=wnd1jKcfBRE</a>