TechEcho

10 comments

merbover 8 years ago

> Initially the on-call engineers didn’t understand the severity of the problemEvery outage I read, something like that happened. At least asana didn't blamed the technology their were using.

评论 #12468464 未加载

katzgrauover 8 years ago

These sort of deeply apologetic and hyper-transparent post-mortems have become commonplace, but sometimes I wonder how beneficial they are.Customers appreciate transparency, but perhaps delving into the fine details of the investigation (various hypotheses, overlooked warning signs, yada yada) might actually end up leaving the customer more unsettled than they would have been otherwise.Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.This is coming from someone who runs an ad server - if that ad server goes down it's damn near catastrophic for my customers and their customers. When we do have a (rare) outage, I sweat it out, reassure customers that people are on it, and give a brief, accurate, and high level explanation without getting into the gruesome details.I'm not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.

评论 #12469015 未加载

评论 #12469851 未加载

评论 #12469699 未加载

评论 #12469616 未加载

评论 #12469017 未加载

评论 #12469018 未加载

madelinecameronover 8 years ago

>And to make things even more confusing, our engineers were all using the dogfooding version of Asana, which runs on different AWS-EC2 instances than the production version... That kind of defeats the purpose of "dogfooding". Sure, you have to use the same code (hopefully) but it doesn't give you the same experience.

评论 #12469774 未加载

bArrayover 8 years ago

Was this incident really recorded minute by minute or is that made up? I've noticed a lot of companies that give this kind of detail like to give a minute by minute report, I just don't understand how they get that accuracy?

评论 #12468274 未加载

评论 #12468266 未加载

评论 #12468262 未加载

评论 #12468268 未加载

评论 #12468276 未加载

评论 #12468281 未加载

kctess5over 8 years ago

I find it interesting that they didn't notice the overloading for so long. Also that it took so long to roll back. Given that they reportedly roll out twice a day, it seems like identifying a rollback target would be fairly quick.

评论 #12468089 未加载

mathattackover 8 years ago

Not a bad reaction. With all the reverts is there a QA issue? Or too many releases?

评论 #12469637 未加载

zzzcpanover 8 years ago

Strangely, there are no actual technical details in the report and the blame is on the process. Although most of the times there is some way to prevent bugs from causing problems with better architecture.

评论 #12469704 未加载

cookiecaperover 8 years ago

Reading through this, it sounds like some basic monitoring would've quickly allowed them to pinpoint the cause instead of wasting time with database servers. All it would take is pulling up the charts in Munin or Datadog or whatever and seeing "Oh, there's a big spike correlated with our deploy and the server is redlining now, better roll that back". A bug or issue in the recent deploy would logically be one of the first suspects in such a circumstance. Don't know why they wasted 30-60 minutes on a red herring. The correlation would be even more obvious if they took advantage of Datadog's event stream and marked each deployment.Additionally, CPU alarms on the web servers should've informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated prior to pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static "try again later" page hosted on a CDN or static-only server. This can be done at the DNS level.Let this be a lesson to all of us. Have basic dashboards and alarming.

评论 #12469689 未加载

qaqover 8 years ago

This is "not that different" from getting a very high load spike do you guys not have some autoscaling setup?

评论 #12469611 未加载

评论 #12469298 未加载

评论 #12468831 未加载

jwatteover 8 years ago

The real support for a frequent deployment system is in the immune system! I've had good luck with a deployment immune system that rolls back if CPU or other load jumps, even if it doesn't immediately cause user failure. (I e, monitor crucial internals, not just user availability)

10 comments

merbover 8 years ago

> Initially the on-call engineers didn’t understand the severity of the problemEvery outage I read, something like that happened. At least asana didn't blamed the technology their were using.

评论 #12468464 未加载

katzgrauover 8 years ago

评论 #12469015 未加载

评论 #12469851 未加载

评论 #12469699 未加载

评论 #12469616 未加载

评论 #12469017 未加载

评论 #12469018 未加载

madelinecameronover 8 years ago

评论 #12469774 未加载

bArrayover 8 years ago

评论 #12468274 未加载

评论 #12468266 未加载

评论 #12468262 未加载

评论 #12468268 未加载

评论 #12468276 未加载

评论 #12468281 未加载

kctess5over 8 years ago

评论 #12468089 未加载

mathattackover 8 years ago

Not a bad reaction. With all the reverts is there a QA issue? Or too many releases?

评论 #12469637 未加载

zzzcpanover 8 years ago

评论 #12469704 未加载

cookiecaperover 8 years ago

评论 #12469689 未加载

qaqover 8 years ago

This is "not that different" from getting a very high load spike do you guys not have some autoscaling setup?

评论 #12469611 未加载

评论 #12469298 未加载

评论 #12468831 未加载

jwatteover 8 years ago

Full technical details on Asana's worst outage

10 comments

Full technical details on Asana's worst outage

10 comments