TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Heroku's AWS outage post-mortem

202 点作者 mileszs大约 14 年前

18 条评论

chrishenn大约 14 年前
<i>Our monitoring systems picked up the problems right away. The on-call engineer quickly determined the magnitude of the problem and woke up the on-call Incident Commander. The IC contacted AWS, and began waking Heroku engineers to work on the problem. Once it became clear that this was going to be a lengthy outage, the Ops team instituted an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all time. Our support, data, and other engineering teams also worked around the clock.</i><p>The system they are using (IC, ops, engineer teams, operational periods) is extremely similar to the Incident Command System. The ICS was developed about 40 years ago for fighting wildfires, but now most government agencies use it to manage any type of incident.<p>I've experienced it first hand and can say it works very well, but I have never seen it used in this context. The great thing about it is it's expandability---it will work for teams of nearly any size. I'd be interested in seeing if any other technology companies/backend teams are using it.<p><a href="http://en.wikipedia.org/wiki/Incident_command_system" rel="nofollow">http://en.wikipedia.org/wiki/Incident_command_system</a>
ekidd大约 14 年前
Kudos to Heroku for taking full responsibility, and for planning to engineer around these kinds of Amazon problems in the future.<p>In particular, I'm delighted to hear that they plan to perform continuous backups on their shared databases:<p><i>3) CONTINUOUS DATABASE BACKUPS FOR ALL. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases... We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.</i><p>Combined with multi-region support, this should make Heroku far more resilient in the future.
评论 #2488302 未加载
评论 #2488127 未加载
adriand大约 14 年前
I'm very impressed by how they take responsibility for this, in their words: "HEROKU TAKES 100% OF THE RESPONSIBILITY FOR THE DOWNTIME AFFECTING OUR CUSTOMERS LAST WEEK."<p>It would be both easy, tempting and heck, even reasonable to assign at least a portion of the blame to Amazon. Their approach is interesting because their customers already know that, but are likely to appreciate their forthright acceptance of responsibility.<p>It's a good lesson. If I'm being totally honest I'd have to admit that, as a developer, I sometimes blame external services or events for things that I have at least partial control over. Perhaps I should adopt Heroku's approach instead.
评论 #2488493 未加载
评论 #2488356 未加载
watchandwait大约 14 年前
The AWS outage is definitely not over. Apparently RDS is built on EBS and they have not all been restored, I can tell you that first hand.
评论 #2489130 未加载
waxman大约 14 年前
Thank you for taking full responsibility.<p>Everyone makes mistakes, so what matters is how you deal with them. This was the right way to respond. Thanks.
markbao大约 14 年前
I wish Amazon was as good at communication and accountability as Heroku is.
chrisbaglieri大约 14 年前
"Block storage is not a cloud-friendly technology".<p>Based on every post-mortem I've read thus far, it's clear how AWS and it's customers approach EBS will change.
bdb大约 14 年前
Where is Amazon's?
评论 #2488086 未加载
greattypo大约 14 年前
It's impressive that they're taking full responsibility, but very surprised there's no mention of refunds..
评论 #2489393 未加载
dpcan大约 14 年前
What the hell? Why is everyone taking responsibilty and giving amazon a free ride? I'm a firm believer that only victims make excuses, and it's admirable to take responsibility, and maybe they should have more redundancy in place, but the way aws has been advertised, most of us felt this kind of thing should never happen even without a 100% uptime guarantee.<p>So, take 100% of the responsibility, but I wouldn't think any less of heroku if they only took 50%.
评论 #2488092 未加载
评论 #2488082 未加载
chubs大约 14 年前
This is why i love hosting on heroku: they'll work their butt off to get it fixed when its down, and i don't have to lift a finger. However, EBS has been long known to be a turd, its a pity they relied on it. Plus, if they had a way to bring it back up in a different region (eg the euro AWS infrastructure) at the flick of a switch, that'd make me less nervous...
AffableSpatula大约 14 年前
I don't think this is particularly 'honorable' or anything like that.. it's the only sensible stance for them to take.<p>Let's be realistic about this; for most people using heroku the alternative would have been bare ec2, and could easily have suffered the same fate as on heroku.<p>Everyone should feel positive that they got to spend ~60 hours just sitting around moaning about being let down, instead of having to sweat their nuts off attempting to rehabilitate crazy, suicidal infrastructure.<p>Even taking this downtime into account, heroku is still cost effective for me in a lot of cases.
metageek大约 14 年前
&#62;<i>It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing</i><p>Heroku should save the customers this pain, by setting up anycast:<p><a href="https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domain_Name_System" rel="nofollow">https://secure.wikimedia.org/wikipedia/en/wiki/Anycast#Domai...</a>
awicklander大约 14 年前
They gloss over their biggest failure; they weren't communicating or interacting with their customers <i>at all</i>.<p>* <a href="http://twitter.com/#!/heroku" rel="nofollow">http://twitter.com/#!/heroku</a> * <a href="http://twitter.com/#!/herokustatus" rel="nofollow">http://twitter.com/#!/herokustatus</a>
评论 #2488674 未加载
oomkiller大约 14 年前
I'd really love to know some details on the continuous backup stuff. Sounds cool.
评论 #2488294 未加载
评论 #2488246 未加载
评论 #2488121 未加载
评论 #2488122 未加载
chrisbaglieri大约 14 年前
I wish more companies (hell people) were as forthright, pragmatic, and sensible as the Heroku gang. Their breakdown and response to the outage is <i>exactly</i> what me as a paying customer wants to hear.<p>Kudos!
mtw大约 14 年前
what about also spreading to multiple providers (i.e. also use rackspace cloud)? they'd be less dependant from amazon issues
trezor大约 14 年前
And now reddit is down again (posting/submitting is impossible). Probably yet another Amazon issue, yet again.<p>In all fairness I've read that the reddit devs have made lots of boneheaded mistakes in their general infrastructure-design, but it still seems Amazon is not a very reliable platform to build your stuff on. Platforms built on top of Amazon's even less so.
评论 #2489802 未加载
评论 #2489966 未加载