Final Root Cause Analysis of Nov 18 Azure Service Interruption

198 点作者 asyncwords超过 10 年前

19 条评论

ChuckMcM超过 10 年前

Nice writeup. I hope that the engineer in question didn't get fired or anything. One of the challenges in SRE/Ops type organizations is to be responsible, take ownership, put in the extra time to fix things you break, but keep the nerve to push out changes. Once an ops team loses its willingness to push large changes, the infrastructure calcifies and you have a much bigger problem on your hand.

评论 #8764090 未加载

评论 #8764252 未加载

评论 #8764306 未加载

评论 #8764489 未加载

tytso超过 10 年前

The really big missing piece that I found in this post mortem is if it only took 30 minutes to revert the original change, why did it take over ten hours to restart the Azure Blob storage servers? This was neatly elided in the last sentence of this paragraph of their writeup:".... We reverted the change globally within 30 minutes of the start of the issue which protected many Azure Blob storage Front-Ends from experiencing the issue. The Azure Blob storage Front-Ends which already entered the infinite loop were unable to accept any configuration changes due to the infinite loop. These required a restart after reverting the configuration change, extending the time to recover."The ten+ hours extension was the vast majority of the outage time; why wasn't the reason for this given? More importantly, what will be done to prevent a similar extension in the time Azure spends belly up if at some point in the future, the Blob servers go insane and have to be restarted?

评论 #8766375 未加载

mrb超过 10 年前

"These Virtual Machines were recreated by repeating the VM provisioning step. Linux Virtual Machines were not affected."So Azure supports Linux VMs?! Microsoft does so little Azure advertising that I had to learn this fact from their RCA. Apparently they do support it since 2012: <a href="http://www.techrepublic.com/blog/linux-and-open-source/microsoft-now-offering-linux-on-azure-what-does-this-mean/" rel="nofollow">http://www.techrepublic.com/blog/linux-and-open-source/micro...</a> but it is likely that many non-users of Azure do not know this.

评论 #8763762 未加载

评论 #8763849 未加载

评论 #8767288 未加载

评论 #8764451 未加载

评论 #8763939 未加载

评论 #8764064 未加载

sandis超过 10 年前

> The engineer fixing the Azure Table storage performance issue believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk.Ugh, I wouldn't want to be that guy (even if there would be no direct repercussions). That said, and as others have highlighted - kudos on the writeup and openness.

评论 #8764343 未加载

评论 #8764129 未加载

评论 #8763923 未加载

coldcode超过 10 年前

Shit happens and at this scale it happens big. I wish everyone would provide details like this when the fan gets hit or your security fails. I'm glad I never have to deal with scale like this, it's pretty scary.

评论 #8763689 未加载

评论 #8764026 未加载

pfortuny超过 10 年前

Impressive non-jargonized report. I would have "quantified" the "small number" but kudos anyway to Microsoft for taking this path towards transparency.

ha292超过 10 年前

This is a good effort. I do have some concerns about it.A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices ?The surface-level answer is that Azure platform lacked tooling. Is that the cause or an effect ? I think it is an effect. There are deeper root causes.Let's ask -- why was it that the design allowed one engineer to effectively bring down Azure ?We often stop at these RCAs when it gets uncomfortable and it starts to point upwards.I say this to the engineer who pressed the buttons: Bravo! You did something that exposed a massive hole in Azure which may have very well prevented a much bigger embarrassment.

评论 #8764463 未加载

nchelluri超过 10 年前

I'm pretty impressed with the openness of this statement.

评论 #8763946 未加载

jabanico超过 10 年前

"Unfortunately, the configuration tooling did not have adequate enforcement of this policy of incrementally deploying the change across the infrastructure." They relied on tooling to do the review of the last step of the process? I would have thought there were a few layers of approval that goes along with that final push into mission critical infrastructure.

评论 #8764617 未加载

mooneater超过 10 年前

Pros: -they are sharing info -they allowed some caustic comments to remain at the bottom of the page (so far).Cons: -This is almost 30 days after the incident -Look at the regions, it was global! -This was a whole chain of issues. I count it as 5 separate issues. This goes deep into how they operate and it does not paint a picture of operational maturity:1: configuration change for the Blob Front-Ends exposed a bug in the Blob Front-Ends2: Blob Front-Ends infinite loop delayed the fix (I count this as a separate issue though I expect some may not)3: As part of a plan to improve performance of the Azure Storage Service, the decision was made to push the configuration change to the entire production service4: Update was made across most regions in a short period of time due to operational error5: Azure infrastructure issue that impacted our ability to provide timely updates via the Service Health DashboardThat is quite a list. [Edit : formatting only]

评论 #8766019 未加载

评论 #8766276 未加载

Redsquare超过 10 年前

Why are they so quiet about SLA credit? Not a word for a month and for a year I have been wasting good money on doubling up services to be inside the SLA + also deploying cross region to ensure zero downtime, what a joke. Surely Azure are not hoping we will forget?

sybhn超过 10 年前

TL;DR>In summary, Microsoft Azure had clear operating guidelines but there was a gap in the deployment tooling that relied on human decisions and protocol.

spudlyo超过 10 年前

After analysis was complete, we released an update to our deployment system tooling to enforce compliance to the above testing and flighting policies for standard updates, whether code or configuration.Hopefully there is a way to disable this policy adherence for when you really need to push out a configuration or code change everywhere quickly.

markveronda超过 10 年前

I cannot believe how many times I have seen a PROD (or new env X) deployment go bad from configuration issues. At least they separate configuration deployments from code deployments, that's a good sign. Why not take it a step further and instead of doing config deployments, use a config server?

评论 #8766338 未加载

teyc超过 10 年前

If utility computing is to is be taken seriously, then it has to institute the same kind of discipline that we see occurring in the airline industry. Recent examples come to mind: pilot letting songstress hold the wheels and wearing the pilots cap - fired. Airline executive overruling pilot over macadamia nuts - million dollar fine.If we wish for a future where cloud computing will be considered reliable enough for air traffic control systems, then management of these infrastructure requires a level of dedication and commitment to process and training.Failover zones need to be isolated not only physically, but also from command and control. A lone engineer should not have sufficient authority or capability to operationally control more than one zone. It is extremely unnerving for enterprises to see that a significant infrastructure like Azure has a root account which can take down the whole of Azure.

forgotAgain超过 10 年前

I hope the engineer in question did not get fired.I also hope that no one who recommended Azure to their employer got fired either.

billarmstrong超过 10 年前

Only one question: Will the engineer be fired?

评论 #8770189 未加载

larrystrange超过 10 年前

OSS is alive and well on the Azure Platform www.microsoft.com/openness

runT1ME超过 10 年前

Does anyone else see the missing piece to this post mortem? An infinite loop made its way onto a majority(? all?) of production servers, and the immediate response is more or less 'we shouldn't have deployed to as many customers, failure should have only happened to a small subset'?I agree that improvements made to their deployment tooling were good and necessary, take the human temptation to skip steps out of the equation.But this exemplifies a major problem our industry suffers from, in that it just taken as a given that critical errors will sometimes make their way into production servers and the best we can do is reduce the impact.I find this absolutely unacceptable. How about we short circuit the process and identify ways to stop that from happening? Were there enough code reviews? Did automated testing fail here? Yes I'm familiar with the halting problem and limitations of formal verification on turing complete languages, but I don't believe it's an excuse.This is tantamount to saying "yeah sometimes our airplanes crash, so from now on we'll just make sure we have less passengers ride in the newer models".

评论 #8764221 未加载

评论 #8764182 未加载

评论 #8764348 未加载

评论 #8765408 未加载

评论 #8764191 未加载

评论 #8764281 未加载

评论 #8764334 未加载

评论 #8765454 未加载

评论 #8765451 未加载