TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Tuesday's Heroku outage post-mortem

69 点作者 bscofield超过 14 年前

7 条评论

erikpukinskis超过 14 年前
I think it's fascinating that a single engineer, who they had on staff, was able to write a patch in one night that improved performance by their messaging system 5x...<p>... <i>and hadn't already done it</i>.<p>This isn't meant as a slight against Heroku at all. They've got an incredible team of engineers. But imagine if Ricardo had said "hey, I could write a patch today that would speed up our messaging system 5x, should I do it?" the rest of the team would've said "OF COURSE!"<p>It reminds me of what happens to your brain when you launch a site. Even before you get feedback, somehow <i>the knowledge that people can use it</i> drastically changes your motivation system. Things that before seemed important are obviously not. Other things that were invisible before become the singular focus of your resources.<p>Maybe we should do fire drills...<p>* Your requests are suddently taking 100x as long to complete. Go!<p>* Your "runway" disappears due to an accounting error and you have 7 days to turn a profit. Go!<p>* 50% of people visiting the site have no idea how to use it. Go!<p>How could we achieve the focus and clarity that a crisis brings on, without having the crisis?
评论 #1840696 未加载
评论 #1840824 未加载
评论 #1842397 未加载
评论 #1840931 未加载
danilocampos超过 14 年前
I'd like to point out the lesson that other industries can learn from IT infrastructure companies.<p>Heroku sells a technical product to a technical audience. They're foundational to their clients' products. So when something goes down, there's only one option: explain, in excruciating detail, exactly what happened, why it happened, and how it's going to be fixed in the future.<p>Why? Because their clients can smell bullshit better than a purebred bloodhound. Too much bullshit means it's time to move on.<p>Beyond being the right thing to do, being accountable is essential to trust. When you fuck up, it will piss people off. That's just life – everyone makes mistakes. So you need to be the guy where people can say "Okay, there was a fuck up, it was bad, but look at how hard these guys worked to fix it. Check out their plans to prevent it in the future."<p>Luckily, the incentives are aligned here to make this mostly non-negotiable. When you get medical malpractice, a financial meltdown or an oil spill going on, the cover-your-ass impulses are much more compelling.<p>Even in those cases though, I insist we need to encourage a culture where accountability and transparency are rewarded. Because, for me, accountable guys are the kind of people I want to do business with.<p>I dunno much about scaling a Rails server, but for now, at least, I know the Heroku guys are the sort of people I'd trust.
评论 #1840224 未加载
评论 #1840117 未加载
评论 #1840108 未加载
评论 #1840171 未加载
评论 #1840174 未加载
评论 #1840231 未加载
absconditus超过 14 年前
Will someone at Heroku please describe your QA process?
评论 #1840042 未加载
评论 #1840101 未加载
评论 #1840059 未加载
smackfu超过 14 年前
Are they going to remove "rock-solid" from their front page copy?
评论 #1840057 未加载
评论 #1840091 未加载
GICodeWarrior超过 14 年前
Does Heroku use anything like 5 Whys to incrementally address organizational-type causes?
random42超过 14 年前
<i>After isolating the bug, we attempted to roll back to a previous version of the routing mesh code. While the rollback solved the initial problem, there as an unexpected incompatibility between the routing mesh and our caching service.</i><p>To me, it seems like they just needed to apply the "Hot patch", instead they panicked(?) and did a lot of unnecessary version control gymnastics, which delayed the bug fix.
评论 #1840106 未加载
评论 #1840099 未加载
dlevine超过 14 年前
I think it's really cool that Heroku is so transparent about their outages. A lot of companies try to cover them up or blame them on someone else.<p>It's refreshing to see a company that not only acknowledges their outages, but even has a list of all past issues and outages. This transparency can only help them to become better in the future.