Heroku's AWS outage post-mortem

202 pointsby mileszsabout 14 years ago

18 comments

chrishennabout 14 years ago

Our monitoring systems picked up the problems right away. The on-call engineer quickly determined the magnitude of the problem and woke up the on-call Incident Commander. The IC contacted AWS, and began waking Heroku engineers to work on the problem. Once it became clear that this was going to be a lengthy outage, the Ops team instituted an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all time. Our support, data, and other engineering teams also worked around the clock.The system they are using (IC, ops, engineer teams, operational periods) is extremely similar to the Incident Command System. The ICS was developed about 40 years ago for fighting wildfires, but now most government agencies use it to manage any type of incident.I've experienced it first hand and can say it works very well, but I have never seen it used in this context. The great thing about it is it's expandability---it will work for teams of nearly any size. I'd be interested in seeing if any other technology companies/backend teams are using it.<a href="http://en.wikipedia.org/wiki/Incident_command_system" rel="nofollow">http://en.wikipedia.org/wiki/Incident_command_system</a>

ekiddabout 14 years ago

Kudos to Heroku for taking full responsibility, and for planning to engineer around these kinds of Amazon problems in the future.In particular, I'm delighted to hear that they plan to perform continuous backups on their shared databases:3) CONTINUOUS DATABASE BACKUPS FOR ALL. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases... We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.Combined with multi-region support, this should make Heroku far more resilient in the future.

评论 #2488302 未加载

评论 #2488127 未加载

adriandabout 14 years ago

I'm very impressed by how they take responsibility for this, in their words: "HEROKU TAKES 100% OF THE RESPONSIBILITY FOR THE DOWNTIME AFFECTING OUR CUSTOMERS LAST WEEK."It would be both easy, tempting and heck, even reasonable to assign at least a portion of the blame to Amazon. Their approach is interesting because their customers already know that, but are likely to appreciate their forthright acceptance of responsibility.It's a good lesson. If I'm being totally honest I'd have to admit that, as a developer, I sometimes blame external services or events for things that I have at least partial control over. Perhaps I should adopt Heroku's approach instead.

评论 #2488493 未加载

评论 #2488356 未加载

watchandwaitabout 14 years ago

The AWS outage is definitely not over. Apparently RDS is built on EBS and they have not all been restored, I can tell you that first hand.

评论 #2489130 未加载

waxmanabout 14 years ago

Thank you for taking full responsibility.Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.

markbaoabout 14 years ago

I wish Amazon was as good at communication and accountability as Heroku is.

chrisbaglieriabout 14 years ago

"Block storage is not a cloud-friendly technology".Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.

bdbabout 14 years ago

Where is Amazon's?

评论 #2488086 未加载

greattypoabout 14 years ago

It's impressive that they're taking full responsibility, but very surprised there's no mention of refunds..

评论 #2489393 未加载

dpcanabout 14 years ago

What the hell? Why is everyone taking responsibilty and giving amazon a free ride? I'm a firm believer that only victims make excuses, and it's admirable to take responsibility, and maybe they should have more redundancy in place, but the way aws has been advertised, most of us felt this kind of thing should never happen even without a 100% uptime guarantee.So, take 100% of the responsibility, but I wouldn't think any less of heroku if they only took 50%.

评论 #2488092 未加载

评论 #2488082 未加载

chubsabout 14 years ago

This is why i love hosting on heroku: they'll work their butt off to get it fixed when its down, and i don't have to lift a finger. However, EBS has been long known to be a turd, its a pity they relied on it. Plus, if they had a way to bring it back up in a different region (eg the euro AWS infrastructure) at the flick of a switch, that'd make me less nervous...

AffableSpatulaabout 14 years ago

I don't think this is particularly 'honorable' or anything like that.. it's the only sensible stance for them to take.Let's be realistic about this; for most people using heroku the alternative would have been bare ec2, and could easily have suffered the same fate as on heroku.Everyone should feel positive that they got to spend ~60 hours just sitting around moaning about being let down, instead of having to sweat their nuts off attempting to rehabilitate crazy, suicidal infrastructure.Even taking this downtime into account, heroku is still cost effective for me in a lot of cases.

metageekabout 14 years ago

>It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homingHeroku should save the customers this pain, by setting up anycast:<a href="https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domain_Name_System" rel="nofollow">https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domai...</a>

awicklanderabout 14 years ago

They gloss over their biggest failure; they weren't communicating or interacting with their customers at all.* <a href="http://twitter.com/#!/heroku" rel="nofollow">http://twitter.com/#!/heroku</a> * <a href="http://twitter.com/#!/herokustatus" rel="nofollow">http://twitter.com/#!/herokustatus</a>

评论 #2488674 未加载

oomkillerabout 14 years ago

I'd really love to know some details on the continuous backup stuff. Sounds cool.

评论 #2488294 未加载

评论 #2488246 未加载

评论 #2488121 未加载

评论 #2488122 未加载

chrisbaglieriabout 14 years ago

I wish more companies (hell people) were as forthright, pragmatic, and sensible as the Heroku gang. Their breakdown and response to the outage is exactly what me as a paying customer wants to hear.Kudos!

mtwabout 14 years ago

what about also spreading to multiple providers (i.e. also use rackspace cloud)? they'd be less dependant from amazon issues

trezorabout 14 years ago

And now reddit is down again (posting/submitting is impossible). Probably yet another Amazon issue, yet again.In all fairness I've read that the reddit devs have made lots of boneheaded mistakes in their general infrastructure-design, but it still seems Amazon is not a very reliable platform to build your stuff on. Platforms built on top of Amazon's even less so.

评论 #2489802 未加载

评论 #2489966 未加载

18 comments

chrishennabout 14 years ago

ekiddabout 14 years ago

评论 #2488302 未加载

评论 #2488127 未加载

adriandabout 14 years ago

评论 #2488493 未加载

评论 #2488356 未加载

watchandwaitabout 14 years ago

The AWS outage is definitely not over. Apparently RDS is built on EBS and they have not all been restored, I can tell you that first hand.

评论 #2489130 未加载

waxmanabout 14 years ago

Thank you for taking full responsibility.Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.

markbaoabout 14 years ago

I wish Amazon was as good at communication and accountability as Heroku is.

chrisbaglieriabout 14 years ago

"Block storage is not a cloud-friendly technology".Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.

bdbabout 14 years ago

Where is Amazon's?

评论 #2488086 未加载

greattypoabout 14 years ago

It's impressive that they're taking full responsibility, but very surprised there's no mention of refunds..

评论 #2489393 未加载

dpcanabout 14 years ago

评论 #2488092 未加载

评论 #2488082 未加载

chubsabout 14 years ago

AffableSpatulaabout 14 years ago

metageekabout 14 years ago

awicklanderabout 14 years ago

评论 #2488674 未加载

oomkillerabout 14 years ago

I'd really love to know some details on the continuous backup stuff. Sounds cool.

评论 #2488294 未加载

评论 #2488246 未加载

评论 #2488121 未加载

评论 #2488122 未加载

chrisbaglieriabout 14 years ago

mtwabout 14 years ago

what about also spreading to multiple providers (i.e. also use rackspace cloud)? they'd be less dependant from amazon issues

trezorabout 14 years ago

评论 #2489802 未加载

评论 #2489966 未加载