TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Tuesday's Heroku outage post-mortem

69 pointsby bscofieldover 14 years ago

7 comments

erikpukinskisover 14 years ago
I think it's fascinating that a single engineer, who they had on staff, was able to write a patch in one night that improved performance by their messaging system 5x...<p>... <i>and hadn't already done it</i>.<p>This isn't meant as a slight against Heroku at all. They've got an incredible team of engineers. But imagine if Ricardo had said "hey, I could write a patch today that would speed up our messaging system 5x, should I do it?" the rest of the team would've said "OF COURSE!"<p>It reminds me of what happens to your brain when you launch a site. Even before you get feedback, somehow <i>the knowledge that people can use it</i> drastically changes your motivation system. Things that before seemed important are obviously not. Other things that were invisible before become the singular focus of your resources.<p>Maybe we should do fire drills...<p>* Your requests are suddently taking 100x as long to complete. Go!<p>* Your "runway" disappears due to an accounting error and you have 7 days to turn a profit. Go!<p>* 50% of people visiting the site have no idea how to use it. Go!<p>How could we achieve the focus and clarity that a crisis brings on, without having the crisis?
评论 #1840696 未加载
评论 #1840824 未加载
评论 #1842397 未加载
评论 #1840931 未加载
danilocamposover 14 years ago
I'd like to point out the lesson that other industries can learn from IT infrastructure companies.<p>Heroku sells a technical product to a technical audience. They're foundational to their clients' products. So when something goes down, there's only one option: explain, in excruciating detail, exactly what happened, why it happened, and how it's going to be fixed in the future.<p>Why? Because their clients can smell bullshit better than a purebred bloodhound. Too much bullshit means it's time to move on.<p>Beyond being the right thing to do, being accountable is essential to trust. When you fuck up, it will piss people off. That's just life – everyone makes mistakes. So you need to be the guy where people can say "Okay, there was a fuck up, it was bad, but look at how hard these guys worked to fix it. Check out their plans to prevent it in the future."<p>Luckily, the incentives are aligned here to make this mostly non-negotiable. When you get medical malpractice, a financial meltdown or an oil spill going on, the cover-your-ass impulses are much more compelling.<p>Even in those cases though, I insist we need to encourage a culture where accountability and transparency are rewarded. Because, for me, accountable guys are the kind of people I want to do business with.<p>I dunno much about scaling a Rails server, but for now, at least, I know the Heroku guys are the sort of people I'd trust.
评论 #1840224 未加载
评论 #1840117 未加载
评论 #1840108 未加载
评论 #1840171 未加载
评论 #1840174 未加载
评论 #1840231 未加载
absconditusover 14 years ago
Will someone at Heroku please describe your QA process?
评论 #1840042 未加载
评论 #1840101 未加载
评论 #1840059 未加载
smackfuover 14 years ago
Are they going to remove "rock-solid" from their front page copy?
评论 #1840057 未加载
评论 #1840091 未加载
GICodeWarriorover 14 years ago
Does Heroku use anything like 5 Whys to incrementally address organizational-type causes?
random42over 14 years ago
<i>After isolating the bug, we attempted to roll back to a previous version of the routing mesh code. While the rollback solved the initial problem, there as an unexpected incompatibility between the routing mesh and our caching service.</i><p>To me, it seems like they just needed to apply the "Hot patch", instead they panicked(?) and did a lot of unnecessary version control gymnastics, which delayed the bug fix.
评论 #1840106 未加载
评论 #1840099 未加载
dlevineover 14 years ago
I think it's really cool that Heroku is so transparent about their outages. A lot of companies try to cover them up or blame them on someone else.<p>It's refreshing to see a company that not only acknowledges their outages, but even has a list of all past issues and outages. This transparency can only help them to become better in the future.