<i>Our monitoring systems picked up the problems right away. The on-call engineer quickly determined the magnitude of the problem and woke up the on-call Incident Commander. The IC contacted AWS, and began waking Heroku engineers to work on the problem.
Once it became clear that this was going to be a lengthy outage, the Ops team instituted an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all time. Our support, data, and other engineering teams also worked around the clock.</i><p>The system they are using (IC, ops, engineer teams, operational periods) is extremely similar to the Incident Command System. The ICS was developed about 40 years ago for fighting wildfires, but now most government agencies use it to manage any type of incident.<p>I've experienced it first hand and can say it works very well, but I have never seen it used in this context. The great thing about it is it's expandability---it will work for teams of nearly any size. I'd be interested in seeing if any other technology companies/backend teams are using it.<p><a href="http://en.wikipedia.org/wiki/Incident_command_system" rel="nofollow">http://en.wikipedia.org/wiki/Incident_command_system</a>
Kudos to Heroku for taking full responsibility, and for planning to engineer around these kinds of Amazon problems in the future.<p>In particular, I'm delighted to hear that they plan to perform continuous backups on their shared databases:<p><i>3) CONTINUOUS DATABASE BACKUPS FOR ALL. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases... We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.</i><p>Combined with multi-region support, this should make Heroku far more resilient in the future.
I'm very impressed by how they take responsibility for this, in their words: "HEROKU TAKES 100% OF THE RESPONSIBILITY FOR THE DOWNTIME AFFECTING OUR CUSTOMERS LAST WEEK."<p>It would be both easy, tempting and heck, even reasonable to assign at least a portion of the blame to Amazon. Their approach is interesting because their customers already know that, but are likely to appreciate their forthright acceptance of responsibility.<p>It's a good lesson. If I'm being totally honest I'd have to admit that, as a developer, I sometimes blame external services or events for things that I have at least partial control over. Perhaps I should adopt Heroku's approach instead.
Thank you for taking full responsibility.<p>Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.
"Block storage is not a cloud-friendly technology".<p>Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.
What the hell? Why is everyone taking responsibilty and giving amazon a free ride? I'm a firm believer that only victims make excuses, and it's admirable to take responsibility, and maybe they should have more redundancy in place, but the way aws has been advertised, most of us felt this kind of thing should never happen even without a 100% uptime guarantee.<p>So, take 100% of the responsibility, but I wouldn't think any less of heroku if they only took 50%.
This is why i love hosting on heroku: they'll work their butt off to get it fixed when its down, and i don't have to lift a finger.
However, EBS has been long known to be a turd, its a pity they relied on it. Plus, if they had a way to bring it back up in a different region (eg the euro AWS infrastructure) at the flick of a switch, that'd make me less nervous...
I don't think this is particularly 'honorable' or anything like that.. it's the only sensible stance for them to take.<p>Let's be realistic about this; for most people using heroku the alternative would have been bare ec2, and could easily have suffered the same fate as on heroku.<p>Everyone should feel positive that they got to spend ~60 hours just sitting around moaning about being let down, instead of having to sweat their nuts off attempting to rehabilitate crazy, suicidal infrastructure.<p>Even taking this downtime into account, heroku is still cost effective for me in a lot of cases.
><i>It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing</i><p>Heroku should save the customers this pain, by setting up anycast:<p><a href="https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domain_Name_System" rel="nofollow">https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domai...</a>
They gloss over their biggest failure; they weren't communicating or interacting with their customers <i>at all</i>.<p>* <a href="http://twitter.com/#!/heroku" rel="nofollow">http://twitter.com/#!/heroku</a>
* <a href="http://twitter.com/#!/herokustatus" rel="nofollow">http://twitter.com/#!/herokustatus</a>
I wish more companies (hell people) were as forthright, pragmatic, and sensible as the Heroku gang. Their breakdown and response to the outage is <i>exactly</i> what me as a paying customer wants to hear.<p>Kudos!
And now reddit is down again (posting/submitting is impossible). Probably yet another Amazon issue, yet again.<p>In all fairness I've read that the reddit devs have made lots of boneheaded mistakes in their general infrastructure-design, but it still seems Amazon is not a very reliable platform to build your stuff on. Platforms built on top of Amazon's even less so.