TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Full technical details on Asana's worst outage

77 点作者 marcog1超过 8 年前

10 条评论

merb超过 8 年前
&gt; Initially the on-call engineers didn’t understand the severity of the problem<p>Every outage I read, something like that happened. At least asana didn&#x27;t blamed the technology their were using.
评论 #12468464 未加载
katzgrau超过 8 年前
These sort of deeply apologetic and hyper-transparent post-mortems have become commonplace, but sometimes I wonder how beneficial they are.<p>Customers appreciate transparency, but perhaps delving into the fine details of the investigation (various hypotheses, overlooked warning signs, yada yada) might actually end up leaving the customer more unsettled than they would have been otherwise.<p>Today I learned that Asana had a bunch of bad deploys and put the icing on the cake with one that resulted in an outage the next day.<p>This is coming from someone who runs an ad server - if that ad server goes down it&#x27;s damn near catastrophic for my customers and their customers. When we do have a (rare) outage, I sweat it out, reassure customers that people are on it, and give a brief, accurate, and high level explanation without getting into the gruesome details.<p>I&#x27;m not saying my approach is best, but I do think trying to avoid scaring people in your explanation is an idea.
评论 #12469015 未加载
评论 #12469851 未加载
评论 #12469699 未加载
评论 #12469616 未加载
评论 #12469017 未加载
评论 #12469018 未加载
madelinecameron超过 8 年前
&gt;And to make things even more confusing, our engineers were all using the dogfooding version of Asana, which runs on different AWS-EC2 instances than the production version<p>... That kind of defeats the purpose of &quot;dogfooding&quot;. Sure, you have to use the same code (hopefully) but it doesn&#x27;t give you the same experience.
评论 #12469774 未加载
bArray超过 8 年前
Was this incident really recorded minute by minute or is that made up? I&#x27;ve noticed a lot of companies that give this kind of detail like to give a minute by minute report, I just don&#x27;t understand how they get that accuracy?
评论 #12468274 未加载
评论 #12468266 未加载
评论 #12468262 未加载
评论 #12468268 未加载
评论 #12468276 未加载
评论 #12468281 未加载
kctess5超过 8 年前
I find it interesting that they didn&#x27;t notice the overloading for so long. Also that it took so long to roll back. Given that they reportedly roll out twice a day, it seems like identifying a rollback target would be fairly quick.
评论 #12468089 未加载
mathattack超过 8 年前
Not a bad reaction. With all the reverts is there a QA issue? Or too many releases?
评论 #12469637 未加载
zzzcpan超过 8 年前
Strangely, there are no actual technical details in the report and the blame is on the process. Although most of the times there is some way to prevent bugs from causing problems with better architecture.
评论 #12469704 未加载
cookiecaper超过 8 年前
Reading through this, it sounds like some basic monitoring would&#x27;ve quickly allowed them to pinpoint the cause instead of wasting time with database servers. All it would take is pulling up the charts in Munin or Datadog or whatever and seeing &quot;Oh, there&#x27;s a big spike correlated with our deploy and the server is redlining now, better roll that back&quot;. A bug or issue in the recent deploy would logically be one of the first suspects in such a circumstance. Don&#x27;t know why they wasted 30-60 minutes on a red herring. The correlation would be even more obvious if they took advantage of Datadog&#x27;s event stream and marked each deployment.<p>Additionally, CPU alarms on the web servers should&#x27;ve informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated <i>prior to</i> pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static &quot;try again later&quot; page hosted on a CDN or static-only server. This can be done at the DNS level.<p>Let this be a lesson to all of us. Have basic dashboards and alarming.
评论 #12469689 未加载
qaq超过 8 年前
This is &quot;not that different&quot; from getting a very high load spike do you guys not have some autoscaling setup?
评论 #12469611 未加载
评论 #12469298 未加载
评论 #12468831 未加载
jwatte超过 8 年前
The real support for a frequent deployment system is in the immune system! I&#x27;ve had good luck with a deployment immune system that rolls back if CPU or other load jumps, even if it doesn&#x27;t immediately cause user failure. (I e, monitor crucial internals, not just user availability)