AWS Post-Mortem

145 点作者 zeit_geist将近 14 年前

9 条评论

rdl将近 14 年前

The thing I wonder about is wtf they didn't manually switch to generator when their automatic controls failed. They had presumably ~5 minutes of UPS; it took them 40 minutes to do this. This probably isn't directly Amazon's fault, but whatever contract datacenter they are using in Europe (probably a PTT, or possibly an international carrier; really curious what facility)I'm wary of using >1 generators to back up loads, thus requiring sync on generators for backup anyway -- much more comfortable with splitting the load up by room and having one generator per, with some kind of switch to allow for pulling generators out for maintenance. This pretty much limits you to 2-3MW per room (the largest economical diesel gensets), but that's not horrible.Really high reliability sites actually run onsite generation as PRIMARY (since it's less reliable to start), and then utility as backup. With the right onsite generation equipment, it can be cheaper/more efficient than the grid, too (by using combined cycle; use heat output to run cooling directly).Still, the 365 Main power outages take the cake; they used rotational UPSes (generators with huge flywheels) which had software bugs such that if input power got turned off and on several times (a common utility failure mode), the unit shut itself off entirely. Doh.

评论 #2881865 未加载

marcamillion将近 14 年前

It seems to me that Amazon Web Services will never truly be VERY stable.Not because I am being cynical, but just based on the nature of what they are doing.They are the biggest provider of large scale cloud-based computing services. They are pushing the boundaries. They are bound to always come upon problems that no one has ever seen before (including themselves) just based on the very nature of their business.So if you are looking for 'rock-solid reliability', maybe it is better to wait for another big company (Google, Apple, etc.) to come behind and fix all the mistakes that Amazon made the first time.That being said, I use AWS and I love it. Granted, I don't use EBS (not directly, via Heroku) and yes I have encountered downtime recently, it's not that big of a deal. I know they aren't messing around, and they are in uncharted territory.I can't reasonably expect them to have the best uptime for a platform that no one has ever built or done before, on the first time around the block. That's very unreasonable.That being said, I will continue using them from now until I outgrow them or the economics becomes painful, because the value I get with paying for what I use far outweighs 24 - 48 hours of downtime per year.

o1iver将近 14 年前

There seems to be a pretty simple solution to these problems: diversification. Like most things in life, putting all your eggs in one basket it not the right choice.The people who use only AWS or only RackSpace or only 1&1 are equally wrong.What you have to do it diversify. Run a production site ghost on some other platform (software/hardware bugs, ...), run by some other provider (bankruptcy, theft, ...) in another country (power cuts, earthquake, ...). As soon as the primary goes down you switch on the secondary. The probability of a total blackout is then squared: 10^-3*10^-3=10^-6.The great thing with these "cloud" platforms is that your secondary system can even "go to sleep" saving you money and then spin up instances as soon as the primary goes down. This is by the way how banks, airport-systems and probably the NSA do it!

larrycatinspace将近 14 年前

I'm thinking AWS needs to implement the Availability Zones: AZ-ChaosMonkey and AZ-ChaosApe. Having a dedicated playground for breaking things where they can start to observe how this complex system reacts to simple failures and gaps in assumptions.

评论 #2881028 未加载

评论 #2881873 未加载

mtkd将近 14 年前

It's a good communication from Amazon - maybe a little too long - could use a summary block at top.The compensation looks generous too.

评论 #2882729 未加载

gfodor将近 14 年前

For all those complaining about AWS I think it's important to not fall into the trap of throwing all of Amazon's services into the same bucket. EBS (and hence, RDS) have shown time and time again to be the most complex offerings and more prone to failure.Generally speaking, at least for now, the parts of your system built on top of EBS should be carefully architected to survive in the face of erratic EBS latency, data corruption, or even downtime. (All of which are part of the standard AWS contract, but happen much more often in practice than if you are used to the mean failure time of a hard disk sitting in a cage.)This pattern leads me to believe that services such as VoltDB that do not directly rely upon attached storage will prove to be the paradigm necessary to get reliable cloud computing, at least in the AWS ecosystem. On-demand provisioning of disk is an extraordinary hard problem, and a world where local ephemeral storage provides durability through redundancy across nodes and AZ's is probably where we are headed.

jwatte将近 14 年前

I thought best practice backup power was to use large flywheels for re-generation, and spin up diesel engines to power the wheel in the event of a loss. That way, there is no phase synchronization issue, just a mechanical clutch. Seems like this outage could have been prevented with better gear?

robryan将近 14 年前

This seems to share a lot of parallels with the last big outage in terms of the API request overload and the EBS replication. Seems like the system need to be able to tell a bit better between a node going down and require a remirror and most of an availability zone going down.

saturn将近 14 年前

As someone who has put a considerable amount of resources moving things into cloud computing - I wanted to believe. But I have changed my mind.Cloud computing scales the efficiencies, yes. It also scales the problems. And because of this, AWS is by several orders of magnitude the worst of my current hosts.I have dedicated servers. No downtime in past year. I have a couple of cloud servers with rackspace. No downtime (although i don't recommend them). I have some VPSes with local providers. No downtime.AWS? More than 24hrs downtime in the last year. Seriously, for someone trying to run web sites reliably - screw that. I'm not using AWS any more.And don't even get me started on the apologists. "EBS slow as treacle? Well you should have been running a multi zone raid-20 redundant array! Duh!". "EC2 instances dying at random? Well you should architect and implement a multi-master failover intelligent grid!"I used to be under some kind of crazy delusional spell that the above was correct and it was somehow my fault that I wasn't correctly adapting to AWS's numerous failings. Well, no more. Now I realise that I should just stick with the super reliable service I know and love from traditional operators. You need to programmatically grow and shrink your app server flock? Great, use AWS. For the other 99.999% of us - stick with what you were using before.

评论 #2882119 未加载

评论 #2881533 未加载

评论 #2881276 未加载

评论 #2881219 未加载

评论 #2881173 未加载

评论 #2883187 未加载