TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Full technical details on Asana's worst outage

77 pointsby marcog1over 8 years ago

10 comments

merbover 8 years ago
&gt; Initially the on-call engineers didn’t understand the severity of the problem<p>Every outage I read, something like that happened. At least asana didn&#x27;t blamed the technology their were using.
评论 #12468464 未加载
katzgrauover 8 years ago
These sort of deeply apologetic and hyper-transparent post-mortems have become commonplace, but sometimes I wonder how beneficial they are.<p>Customers appreciate transparency, but perhaps delving into the fine details of the investigation (various hypotheses, overlooked warning signs, yada yada) might actually end up leaving the customer more unsettled than they would have been otherwise.<p>Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.<p>This is coming from someone who runs an ad server - if that ad server goes down it&#x27;s damn near catastrophic for my customers and their customers. When we do have a (rare) outage, I sweat it out, reassure customers that people are on it, and give a brief, accurate, and high level explanation without getting into the gruesome details.<p>I&#x27;m not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.
评论 #12469015 未加载
评论 #12469851 未加载
评论 #12469699 未加载
评论 #12469616 未加载
评论 #12469017 未加载
评论 #12469018 未加载
madelinecameronover 8 years ago
&gt;And to make things even more confusing, our engineers were all using the dogfooding version of Asana, which runs on different AWS-EC2 instances than the production version<p>... That kind of defeats the purpose of &quot;dogfooding&quot;. Sure, you have to use the same code (hopefully) but it doesn&#x27;t give you the same experience.
评论 #12469774 未加载
bArrayover 8 years ago
Was this incident really recorded minute by minute or is that made up? I&#x27;ve noticed a lot of companies that give this kind of detail like to give a minute by minute report, I just don&#x27;t understand how they get that accuracy?
评论 #12468274 未加载
评论 #12468266 未加载
评论 #12468262 未加载
评论 #12468268 未加载
评论 #12468276 未加载
评论 #12468281 未加载
kctess5over 8 years ago
I find it interesting that they didn&#x27;t notice the overloading for so long. Also that it took so long to roll back. Given that they reportedly roll out twice a day, it seems like identifying a rollback target would be fairly quick.
评论 #12468089 未加载
mathattackover 8 years ago
Not a bad reaction. With all the reverts is there a QA issue? Or too many releases?
评论 #12469637 未加载
zzzcpanover 8 years ago
Strangely, there are no actual technical details in the report and the blame is on the process. Although most of the times there is some way to prevent bugs from causing problems with better architecture.
评论 #12469704 未加载
cookiecaperover 8 years ago
Reading through this, it sounds like some basic monitoring would&#x27;ve quickly allowed them to pinpoint the cause instead of wasting time with database servers. All it would take is pulling up the charts in Munin or Datadog or whatever and seeing &quot;Oh, there&#x27;s a big spike correlated with our deploy and the server is redlining now, better roll that back&quot;. A bug or issue in the recent deploy would logically be one of the first suspects in such a circumstance. Don&#x27;t know why they wasted 30-60 minutes on a red herring. The correlation would be even more obvious if they took advantage of Datadog&#x27;s event stream and marked each deployment.<p>Additionally, CPU alarms on the web servers should&#x27;ve informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated <i>prior to</i> pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static &quot;try again later&quot; page hosted on a CDN or static-only server. This can be done at the DNS level.<p>Let this be a lesson to all of us. Have basic dashboards and alarming.
评论 #12469689 未加载
qaqover 8 years ago
This is &quot;not that different&quot; from getting a very high load spike do you guys not have some autoscaling setup?
评论 #12469611 未加载
评论 #12469298 未加载
评论 #12468831 未加载
jwatteover 8 years ago
The real support for a frequent deployment system is in the immune system! I&#x27;ve had good luck with a deployment immune system that rolls back if CPU or other load jumps, even if it doesn&#x27;t immediately cause user failure. (I e, monitor crucial internals, not just user availability)