Post-Mortem for Google Compute Engine’s Global Outage on April 11

799 点作者 sgrytoyr大约 9 年前

40 条评论

brianwawok大约 9 年前

This is a very good Post-Mortem.As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?Making software is hard....

评论 #11490461 未加载

评论 #11489975 未加载

评论 #11490509 未加载

评论 #11489968 未加载

评论 #11490378 未加载

评论 #11490250 未加载

评论 #11490812 未加载

评论 #11493330 未加载

评论 #11489994 未加载

评论 #11490470 未加载

评论 #11490058 未加载

评论 #11490090 未加载

评论 #11491442 未加载

评论 #11490465 未加载

评论 #11495368 未加载

评论 #11491835 未加载

评论 #11490095 未加载

评论 #11490664 未加载

评论 #11491051 未加载

评论 #11490442 未加载

评论 #11490272 未加载

评论 #11492665 未加载

评论 #11493158 未加载

评论 #11491044 未加载

teraflop大约 9 年前

> There are a number of lessons to be learned from this event -- for example, that the safeguard of a progressive rollout can be undone by a system designed to mask partial failures -- ...This is a really important point that should be more generally known. To quote Google's own "Paxos Made Live" paper, from 2007:> In closing we point out a challenge that we faced in testing our system for which we have no systematic solution. By their very nature, fault-tolerant systems try to mask problems. Thus they can mask bugs or configuration problems while insidiously lowering their own fault-tolerance.As developers we can try to bear this principle in mind, but as Monday's incident demonstrated, mistakes can still happen. So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?

评论 #11490839 未加载

评论 #11490342 未加载

评论 #11490346 未加载

评论 #11490697 未加载

评论 #11491093 未加载

评论 #11494497 未加载

评论 #11492969 未加载

评论 #11490698 未加载

cosud大约 9 年前

Great writeup! PS: "To make error is human. To propagate error to all server in automatic way is devops." -DevOps Borat

cjbprime大约 9 年前

It looks like there were at least three catastrophic bugs present:1. Evaluated a configuration change before the change had finished syncing across all configuration files, resulting in rejecting the change.2. So it tried to reject the change, but actually just deleted everything instead.3. Something was supposed to catch changes that break everything, and it detected that everything was broken but its attempt to do anything to fix it failed.It is hard to imagine that this system has good test coverage.

评论 #11490144 未加载

评论 #11490160 未加载

评论 #11492937 未加载

评论 #11491413 未加载

评论 #11490538 未加载

stcredzero大约 9 年前

In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.Classic Two Generals. "No news is good news," generally isn't a good design philosophy for systems designed to detect trouble. How do we know that stealthy ninjas haven't assassinated our sentries? Well, we haven't heard anything wrong...

评论 #11492049 未加载

评论 #11491283 未加载

Gravityloss大约 9 年前

I'm waiting for the time when they push over the air updates to airplanes in flight."You can fly safely, we have canaries and staged deployment"A year forward:"Unfortunately because the canary verification as well as the staged deployment code was broken, instead of one crash and 300 dead, an update was pushed to all aircraft, which subsequently caused them to crash, killing 70,000 people."I'm not 100% sure why they don't do the staged deployment for google scale server networking over a few days (or even weeks in some cases) instead of a few hours, but I don't know the details here...It's good that they had manually triggerable configuration rollback possibility and a pre-set policy so it was solved so quickly.

评论 #11491804 未加载

评论 #11491790 未加载

评论 #11493209 未加载

评论 #11493284 未加载

ndesaulniers大约 9 年前

At Google, they do these really awesome post-mortems when there's a major failure. It provides a point of reflection, and are usually well written entertaining reads. Didn't know they made (some?) public.They're a good learning exercise writing one, and is more of a learning exercise than a punishment.

评论 #11493289 未加载

评论 #11493379 未加载

dylanz大约 9 年前

Completely off topic, but this thread is an example of why I (and a lot of people) want collapsible comments native to HN. I'm on my phone, in Safari, and I had to scroll for over 20 seconds just to reach the second comment. The first comment was a tangent about self-driving cars, which while relevant, I didn't want to read about.

评论 #11497706 未加载

ikeboy大约 9 年前

>However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.>Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.I assume the software was originally tested to make sure it works in case of failure. It would be interesting to know exactly what the bug was and why it didn't show in tests.

评论 #11490153 未加载

pjlegato大约 9 年前

Attention startups: this is what incident post-mortems should look like.

评论 #11490747 未加载

eranation大约 9 年前

This is very interesting. From the little I understand (sorry for using AWS terms as I am more versed with AWS than GCE) this can happen to AWS as well right? even if your software is deployed to multiple AZs / multiple regions, if bad routing / network configuration makes it through the various protection mechanisms then basically no amount of redundancy can help if your service is part of the non functional IP block. I mean it seems no matter how redundant you are, there will always be somewhere along the line a single point of failure, even if it has multiple mechanism to prevent it from happening, if all of these mechanisms fail, then it's still a single point. What prevents this from happening at Azure / AWS? Is there anything that general internet routing protocols need to change to prevent it from happening?e.g. I'm sure that we will never hear that Bank of X has transferred a billion dollar to an account but because of propagation errors it published only the credit but didn't finish the debit and now we have two billionaires. This two or more phase commit is pretty much bulletproof in banking as far as I know, and banks are not known to be technologically more advanced than Google, how come internet routing is so prone to errors that can an entire cloud service unavailable for even a small period of time? I'm far from knowing much about networking (although I took some graduate networking courses, I still feel I know practically nothing about it...) So I would appreciate if someone versed in this ELI5 whether it can happen in AWS and Azure regardless of how redundant you are, (which leads to a notion of cross cloud provider redundancy which I'm sure is used in some places) and whether the banking analogy is fair and relevant, and if there are any RFCs to make world-blackout routing nightmares less likely to happen.

评论 #11490411 未加载

评论 #11490329 未加载

评论 #11494285 未加载

wyldfire大约 9 年前

> . Internal monitors generated dozens of alerts in the seconds after the traffic loss became visible at 19:08 ... revert the most recent configuration changes ... the time from detection to decision to revert to the end of the outage was thus just 18 minutes.It's certainly good that they detected it as fast as they did. But I wonder if the fix time could be improved upon? Was the majority of that time spent discussing the corrective action to be taken? Or does it take that much time to replicate the fix?

评论 #11490035 未加载

评论 #11490105 未加载

评论 #11490057 未加载

评论 #11490211 未加载

obulpathi大约 9 年前

> Finally, to underscore how seriously we are taking this event, we are offering GCE and VPN service credits to all impacted GCP applications equal to (respectively) 10% and 25% of their monthly charges for GCE and VPN.These credits exceed what is promised by Google Cloud in their SLA's for Compute Engine and VPN service!

评论 #11490119 未加载

评论 #11490185 未加载

balls187大约 9 年前

Nice post mortem.That outtage gives GCE at best a four 9's reliability for 2016.

评论 #11490236 未加载

huula大约 9 年前

I always like Google's serious attitude towards engineering, even after they have made some mistakes, they never try to hide anything.

totally大约 9 年前

> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push processI'm sure the devil is in the details, but generally speaking, these are 2 instances of critical code that gets exercised infrequently, which is a good place for bugs to hide.

pbreit大约 9 年前

Do SLAs even matter in the slightest? Or are they just sort of "feel-good" things or ways for negotiators to demonstrate their worth?

评论 #11490129 未加载

评论 #11490154 未加载

heisenbit大约 9 年前

"Lessons learned from reading post-mortems" <a href="http://danluu.com/postmortem-lessons/" rel="nofollow">http://danluu.com/postmortem-lessons/</a> is a good place to dig deeperThe first graph quoted from a survey paper is a classic fitting the GCE outage well:Initial error --92%--> Incorrect handling of errors explicitly signaled in software

评论 #11493695 未加载

simonebrunozzi大约 9 年前

I love his signature: "Benjamin Treynor Sloss | VP 24x7".

rdtsc大约 9 年前

> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.Always test your crash / exception handling / special case termination+recovery code in production.I have seen this too often. Most often in in "every day" cases when service has a "nice" catch way of stopping and recovering. Then has a separate "if killed by SIGKILL/immediate power failure" crash and recovery. This last bit never gets tests and run in production.One day power failure happens, service restart and tries to recover. Code that almost never runs, now runs and the whole thing goes into an unknown broken state.

评论 #11505958 未加载

halayli大约 9 年前

This isn't the first time a config system at Google causes a major outage.<a href="https://googleblog.blogspot.com/2014/01/todays-outage-for-several-google.html" rel="nofollow">https://googleblog.blogspot.com/2014/01/todays-outage-for-se...</a>

评论 #11492646 未加载

DanielDent大约 9 年前

My post yesterday seems even more relevant today: <a href="https://news.ycombinator.com/item?id=11477552" rel="nofollow">https://news.ycombinator.com/item?id=11477552</a>It's a shame it's not easier or more common for people to create clones of (most|all) of their infrastructure for testing purposes.Something like half of outages are caused by configuration oopsies.If you accept that configuration is code, then you also come to the following disturbing conclusion: the usual test environment for critical network-related code in most environments is the production environment.

评论 #11493115 未加载

zaroth大约 9 年前

For the amount this cost them, they should have bought CloudFlare. If you play with [global BGP anycast] you are bound to get burned. This is not the first time that BGP took out your entire routing. This is probably not the last time that BPG will take out your entire routing. Whoever's job it was to watch the routing, I am sorry.Pulling your own worldwide routes because you have too much automation; it will make a good story once it's filtered down a bit! Icarus was barely up in the air, too early for a fall.

swills大约 9 年前

The thing that stood out for me was:"...team...worked in shifts overnight..."

评论 #11492982 未加载

grogers大约 9 年前

How important for redundancy/quality of service is the feature of advertising each region's IP blocks from multiple points in Google's network? It seems like region isolation is the most important quality that Google's network could provide, and their current design is what made something like this possible, not just the bugs in the configuration propagation. They mention the ability of the internet to route around failures, so why not rely on that instead?

trhway大约 9 年前

as devops Borat was saying all along, automated propagation of a error as the main root cause here. A error (new configuration) should be rolled out site by site - ok us-east1, move onto us-west1 ... ok, move onto ... . A canary site may be the first in sequence, yet success ("no failure reported") can't be a big "ok" for automated push to all sites at the same time.

mjevans大约 9 年前

I hope that one of their solutions is the obvious one; make change control testing a closed loop instead of an open loop. (Watch for /success/ reported instead of failure notification.)

platz大约 9 年前

> configuration fileconfiguration files strike again - remember knight capital?

nickysielicki大约 9 年前

What does Google use for BGP? Quagga, OpenBGPD, BIRD, their own?Also, does anyone have a link to statistics on global BGP software usage? I'm curious what the marketshare looks like.

评论 #11493422 未加载

Tistel大约 9 年前

The postmortem used the word "quirk." They might consider drilling down on the specifics there. Especially if that is the heart of the bug/accident.

JustUhThought大约 9 年前

Just a thought. Maybe change the name from 'post-mortem' to, anything else before the event actually is a post-mortem.

sengork大约 9 年前

Networking issues in either the storage or communication subsystems of any platform normally result in wide-spread disruptions.

itaifrenkel大约 9 年前

What is the reason different GCE regions use the same IP blocks?

hvass大约 9 年前

What is defense in depth? It is mentioned as a core principle.

评论 #11493748 未加载

awinter-py大约 9 年前

chaos monkey?

hsod大约 9 年前

> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.Perhaps the progressive rollout should wait for an affirmative conclusion instead of assuming no news is good news? I'm not being snarky, there may be some reason they don't do this.

评论 #11490944 未加载

contingencies大约 9 年前

TLDR; they simply didn't test their (global!) custom route announcement management software. An edge case was triggered in production, and they gee-whiz-automatically went offline. Epic fail.PS. To the downvoters, truth hurts.

评论 #11493838 未加载

评论 #11494180 未加载

herrvogel-大约 9 年前

A bit of topic, but it really bugs me, that the banner on the top so pixilated.

qaq大约 9 年前

DRY "The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management."

评论 #11490189 未加载

评论 #11505830 未加载

评论 #11492055 未加载

NetStrikeForce大约 9 年前

I think most people are missing the main failure point: Why does one change propagate automatically to all regions?All this could have been contained if they deployed changes on different regions at different times. That would also help with screwing less your overseas users by running a maintenance at 10am their local time :-)

评论 #11493106 未加载

40 条评论

brianwawok大约 9 年前

评论 #11490461 未加载

评论 #11489975 未加载

评论 #11490509 未加载

评论 #11489968 未加载

评论 #11490378 未加载

评论 #11490250 未加载

评论 #11490812 未加载

评论 #11493330 未加载

评论 #11489994 未加载

评论 #11490470 未加载

评论 #11490058 未加载

评论 #11490090 未加载

评论 #11491442 未加载

评论 #11490465 未加载

评论 #11495368 未加载

评论 #11491835 未加载

评论 #11490095 未加载

评论 #11490664 未加载

评论 #11491051 未加载

评论 #11490442 未加载

评论 #11490272 未加载

评论 #11492665 未加载

评论 #11493158 未加载

评论 #11491044 未加载

teraflop大约 9 年前

评论 #11490839 未加载

评论 #11490342 未加载

评论 #11490346 未加载

评论 #11490697 未加载

评论 #11491093 未加载

评论 #11494497 未加载

评论 #11492969 未加载

评论 #11490698 未加载

cosud大约 9 年前

Great writeup! PS: "To make error is human. To propagate error to all server in automatic way is devops." -DevOps Borat

cjbprime大约 9 年前

评论 #11490144 未加载

评论 #11490160 未加载

评论 #11492937 未加载

评论 #11491413 未加载

评论 #11490538 未加载

stcredzero大约 9 年前

评论 #11492049 未加载

评论 #11491283 未加载

Gravityloss大约 9 年前

评论 #11491804 未加载

评论 #11491790 未加载

评论 #11493209 未加载

评论 #11493284 未加载

ndesaulniers大约 9 年前

评论 #11493289 未加载

评论 #11493379 未加载

dylanz大约 9 年前

评论 #11497706 未加载

ikeboy大约 9 年前

评论 #11490153 未加载

pjlegato大约 9 年前

Attention startups: this is what incident post-mortems should look like.

评论 #11490747 未加载

eranation大约 9 年前

评论 #11490411 未加载

评论 #11490329 未加载

评论 #11494285 未加载

wyldfire大约 9 年前

评论 #11490035 未加载

评论 #11490105 未加载

评论 #11490057 未加载

评论 #11490211 未加载

obulpathi大约 9 年前

评论 #11490119 未加载

评论 #11490185 未加载

balls187大约 9 年前

Nice post mortem.That outtage gives GCE at best a four 9's reliability for 2016.

评论 #11490236 未加载

huula大约 9 年前

I always like Google's serious attitude towards engineering, even after they have made some mistakes, they never try to hide anything.

totally大约 9 年前

> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push processI'm sure the devil is in the details, but generally speaking, these are 2 instances of critical code that gets exercised infrequently, which is a good place for bugs to hide.

pbreit大约 9 年前

Do SLAs even matter in the slightest? Or are they just sort of "feel-good" things or ways for negotiators to demonstrate their worth?

评论 #11490129 未加载

评论 #11490154 未加载

heisenbit大约 9 年前

评论 #11493695 未加载

simonebrunozzi大约 9 年前

I love his signature: "Benjamin Treynor Sloss | VP 24x7".

rdtsc大约 9 年前

> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.Always test your crash / exception handling / special case termination+recovery code in production.I have seen this too often. Most often in in "every day" cases when service has a "nice" catch way of stopping and recovering. Then has a separate "if killed by SIGKILL/immediate power failure" crash and recovery. This last bit never gets tests and run in production.One day power failure happens, service restart and tries to recover. Code that almost never runs, now runs and the whole thing goes into an unknown broken state.

评论 #11505958 未加载

halayli大约 9 年前

评论 #11492646 未加载

DanielDent大约 9 年前

评论 #11493115 未加载

zaroth大约 9 年前

swills大约 9 年前

The thing that stood out for me was:"...team...worked in shifts overnight..."

评论 #11492982 未加载

grogers大约 9 年前

trhway大约 9 年前

mjevans大约 9 年前

I hope that one of their solutions is the obvious one; make change control testing a closed loop instead of an open loop. (Watch for /success/ reported instead of failure notification.)

platz大约 9 年前

> configuration fileconfiguration files strike again - remember knight capital?

nickysielicki大约 9 年前

What does Google use for BGP? Quagga, OpenBGPD, BIRD, their own?Also, does anyone have a link to statistics on global BGP software usage? I'm curious what the marketshare looks like.

评论 #11493422 未加载

Tistel大约 9 年前

The postmortem used the word "quirk." They might consider drilling down on the specifics there. Especially if that is the heart of the bug/accident.

JustUhThought大约 9 年前

Just a thought. Maybe change the name from 'post-mortem' to, anything else before the event actually is a post-mortem.

sengork大约 9 年前

Networking issues in either the storage or communication subsystems of any platform normally result in wide-spread disruptions.