More details about the October 4 outage

473 点作者 moneil971超过 3 年前

39 条评论

> Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.I'm so glad to see that they framed this in terms of a bug in a tool designed to prevent human error, rather than simply blaming it on human error.

评论 #28764491 未加载

评论 #28765412 未加载

评论 #28764413 未加载

评论 #28770791 未加载

dmoy超过 3 年前

Interesting bit on recovery w.r.t. the electrical grid> flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems ...I wish there was a bit more detail in here. What's the worst case there? Brownouts, exploding transformers? Or less catastrophic?

评论 #28763053 未加载

评论 #28763691 未加载

评论 #28765125 未加载

评论 #28762974 未加载

评论 #28768043 未加载

评论 #28763013 未加载

chomp超过 3 年前

So someone ran "clear mpls lsp" instead of "show mpls lsp"?

评论 #28764327 未加载

评论 #28764030 未加载

评论 #28764746 未加载

ghostoftiber超过 3 年前

Incidentally the facebook app itself really handled this gracefully. When the app can't connect to facebook, it displays "updates" from a pool of cached content. It looks and feels like facebook is there, but we know it's not. I didn't notice this until the outage and I thought it was neat.

评论 #28765637 未加载

评论 #28764208 未加载

评论 #28764520 未加载

评论 #28769519 未加载

cnst超过 3 年前

Note that contrary to popular reports, DNS was NOT to blame for this outage — for once DNS worked exactly as per per the spec, design and configuration:> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.

评论 #28767391 未加载

Hokusai超过 3 年前

> One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses.What is the target audience of this post? It is too technical for non-technical people, but also it is dumbed down to try to include people that does not know how the internet works. I feel like I'm missing something.

评论 #28763517 未加载

评论 #28763049 未加载

评论 #28763467 未加载

评论 #28763834 未加载

评论 #28763891 未加载

评论 #28763104 未加载

评论 #28763118 未加载

评论 #28763279 未加载

评论 #28765116 未加载

评论 #28764482 未加载

评论 #28763071 未加载

评论 #28763141 未加载

评论 #28763925 未加载

Animats超过 3 年前

You can see the security/reliability tradeoff problem here.You need a control plane. But what does it run over? Your regular data links? A problem if it also controls their configuration. Something outside your own infrastructure, like a modest connection to the local ISP as a backup? That's an attack vector.One popular solution is to keep both the current and previous generation of the control plane up. Both Google and AT&T seem to have done that. AT&T kept Signalling System 5 up for years after SS7 was doing all the work. Having two totally different technologies with somewhat different paths is helpful.

评论 #28768484 未加载

pm2222超过 3 年前

I'm confused by this. Are the DNS servers inside the backbone, or outside?<pre><code> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.</code></pre>

评论 #28764214 未加载

评论 #28763700 未加载

评论 #28764085 未加载

评论 #28763702 未加载

评论 #28764335 未加载

tigerlily超过 3 年前

We want ramenporn

评论 #28763366 未加载

vitus超过 3 年前

Google had a comparable outage several years ago.<a href="https://status.cloud.google.com/incident/cloud-networking/19009" rel="nofollow">https://status.cloud.google.com/incident/cloud-networking/19...</a>This event left a lot of scar tissue across all of Technical Infrastructure, and the next few months were not a fun time (e.g. a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust).I'd be curious to see what systemic changes happen at FB as a result, if any.

评论 #28763280 未加载

评论 #28764580 未加载

评论 #28763179 未加载

评论 #28764457 未加载

评论 #28765178 未加载

评论 #28764649 未加载

评论 #28763722 未加载

评论 #28764445 未加载

nabakin超过 3 年前

Yesterday's post from Facebook<a href="https://engineering.fb.com/2021/10/04/networking-traffic/outage/" rel="nofollow">https://engineering.fb.com/2021/10/04/networking-traffic/out...</a>

fakeythrow8way超过 3 年前

For a lot of people in countries outside the US, Facebook _is_ the internet. Facebook has cut deals with various ISPs outside the US to allow people to use their services without it costing any data. Facebook going down is a mild annoyance for us but a huge detriment to, say, Latin America.

i_like_apis超过 3 年前

<pre><code> the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this. this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. </code></pre> Sounds like it was the perfect storm.

Ansil849超过 3 年前

> The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.This makes it sound like Facebook has physically laid "tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers". Is this in fact true?

评论 #28763544 未加载

评论 #28764730 未加载

1vuio0pswjnm7超过 3 年前

"Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the border gateway protocol (BGP).""To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection."Correct me if I am wrong, but here "DNS servers" means the computers, not the software running on them, i.e., each computer is running both DNS software and a BGP daemon. I am not aware of DNS server software that disables BGP advertisements but a BGP daemon could do it.For example, a BGP daemon like ExaBGP can execute a DNS query, check the output and disable advertisements if the query fails.<a href="https://github.com/Exa-Networks/exabgp" rel="nofollow">https://github.com/Exa-Networks/exabgp</a>

评论 #28768394 未加载

nonbirithm超过 3 年前

DNS seems to be a massive point of failure everywhere, even taking out the tools to help deal with outages themselves. The same thing happened to Azure multiple times in the past, causing complete service outages. Surely there must be some way to better mitigate DNS misconfiguration by now, given the exceptional importance of DNS?

评论 #28765389 未加载

评论 #28764092 未加载

评论 #28764178 未加载

评论 #28766462 未加载

codebolt超过 3 年前

Apparently they had to bring in the angle grinder to get access to the server room.<a href="https://twitter.com/cullend/status/1445156376934862848?t=P5ua0Fk7iPT5g-05bgaA8w&s=19" rel="nofollow">https://twitter.com/cullend/status/1445156376934862848?t=P5u...</a>

评论 #28763421 未加载

评论 #28763754 未加载

评论 #28763424 未加载

cube00超过 3 年前

> We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this.If you correctly design your security with appropriate fall backs you don't need to make this trade off.If that story of the Facebook campus having no physical key holes on doors is true it just speaks to an arrogance of assuming things can never fail so we don't even need to bother planning for it.

评论 #28764517 未加载

tantalor超过 3 年前

So it wasn't a config change, it was a command-of-death.

perryizgr8超过 3 年前

> Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.Interesting that they think fluctuations of tens of megawatts would risk electrical systems. If the equipment was handling that much continuous load, wouldn't it also easily handle the resumption of the same load? Also I totally did not understand how power usage would affect caches.

评论 #28769594 未加载

harshreality超过 3 年前

> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.No, it's (clearly) not a guaranteed indication of that. Logic fail. Infrastructure tools at that scale need to handle all possible causes of test failures. "Is the internet down or only the few sites I'm testing?" is a classic network monitoring script issue.

评论 #28764547 未加载

kaustubhvp超过 3 年前

Why were all BGP routes advertised from same set of servers in same DC which were pointed by ALL fb owned domains?

评论 #28770150 未加载

byron22超过 3 年前

What kind of bgp command would do that?

perryizgr8超过 3 年前

When facebook is directly peering to so many other ASes, why would they not have static routes in place for those direct links? Why run BGP for that? It's not like there is going to be a better route than the direct link. If the link goes down, then you can rely on BGP to reroute.

jimmyvalmer超过 3 年前

tldr; a maintenance query was issued that inexplicably severed FB's data centers from the internet, which unnecessarily caused their DNS servers to mark themselves defunct, which made it all but impossible for their guys to repair the problem from HQ, which compelled them to physically dispatch field units whose progress was stymied by recent increased physical security measures.

评论 #28763633 未加载

HenryKissinger超过 3 年前

> During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globallyImagine being this person.Tomorrow on /r/tifu.

评论 #28763224 未加载

评论 #28763134 未加载

评论 #28763010 未加载

评论 #28762969 未加载

jasonjei超过 3 年前

Will somebody lose a job over this?

评论 #28763957 未加载

评论 #28765087 未加载

henrypan1超过 3 年前

Wow

Jamie9912超过 3 年前

What would they have done if the whole data center was destroyed?

评论 #28763750 未加载

bilater超过 3 年前

I want to know what happened to the poor engineer who issued the command?

评论 #28763612 未加载

评论 #28764821 未加载

评论 #28763592 未加载

louwrentius超过 3 年前

> Our primary and out-of-band network access was downDon't create circular dependencies.

评论 #28763488 未加载

评论 #28763681 未加载

TedShiller超过 3 年前

During the outage, FB briefly made the world a better place

jbschirtzs超过 3 年前

"The Devil's Backbone"

halotrope超过 3 年前

It is completely logical but still kind of amazing, that facebook plugged their globally distributed datacenters together with physical wire.

评论 #28763608 未加载

ThinkBeat超过 3 年前

I am curious about something.It has been quite a while since i had any job that required me to think about DCs.Back in the day we would have a setup of regular modems.If all hell broke loose, then we could dial up a modem and have access to the system. It was not fast, and it was a pain, but we could get access that way.(I am skipping a lot of steps. There was heavy security involved in getting access)I guess landlines might not be an option anymore??

xphos超过 3 年前

I don't know the -f in rm -rf isn't a bug xD. I feel sorry for the poor engineer who fat fingered the command. It definitely highlights an anti-pattern in the command line but the fact that, that singular console had the power to effect the entire network highlights an "interesting" design choice indeed.

评论 #28783659 未加载

Ansil849超过 3 年前

> We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this. From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.I found this to be an extremely deceptive conclusion. This makes it sound like the issue was that Facebook's physical security is just too gosh darn good. But the issue was not Facebook's data center physical security protocols. The issue was glossed over in the middle of the blogpost:> Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.The issue was faulty audit code. It is disingenuous to then attempt to spin this like the downtime was due to Facebook's amazing physec protocols.

评论 #28763635 未加载

评论 #28763632 未加载

评论 #28763601 未加载

评论 #28763703 未加载

rblion超过 3 年前

Was this a way to delete a lot of evidence before shit really hit the fan?'After reading this, I can't help but feel this was a calculated move.It gives FB a chance to hijack media attention from the whistleblower. It gives them a chance to show the average peson, 'hey, we make mistakes and we have a review process to improve our systems'.The timing is too perfect if you ask me.

评论 #28764170 未加载

评论 #28763475 未加载

rvnx超过 3 年前

It's a no apologies messages:"We failed, our processes failed, our recovery process only partially worked, we celebrate failure. Our investors were not happy, our users were not happy, some people probably ended in physically dangerous situations due to WhatsApp being unavailable but it's ok. We believe a tradeoff like this is worth it."- Your engineering team.

评论 #28763029 未加载

评论 #28763386 未加载

评论 #28762959 未加载

39 条评论

mumblemumble超过 3 年前

评论 #28764491 未加载

评论 #28765412 未加载

评论 #28764413 未加载

评论 #28770791 未加载

dmoy超过 3 年前

评论 #28763053 未加载

评论 #28763691 未加载

评论 #28765125 未加载

评论 #28762974 未加载

评论 #28768043 未加载

评论 #28763013 未加载

chomp超过 3 年前

So someone ran "clear mpls lsp" instead of "show mpls lsp"?

评论 #28764327 未加载

评论 #28764030 未加载

评论 #28764746 未加载

ghostoftiber超过 3 年前

评论 #28765637 未加载

评论 #28764208 未加载

评论 #28764520 未加载

评论 #28769519 未加载

cnst超过 3 年前

评论 #28767391 未加载

Hokusai超过 3 年前

评论 #28763517 未加载

评论 #28763049 未加载

评论 #28763467 未加载

评论 #28763834 未加载

评论 #28763891 未加载

评论 #28763104 未加载

评论 #28763118 未加载

评论 #28763279 未加载

评论 #28765116 未加载

评论 #28764482 未加载

评论 #28763071 未加载

评论 #28763141 未加载

评论 #28763925 未加载

Animats超过 3 年前

评论 #28768484 未加载

pm2222超过 3 年前

评论 #28764214 未加载

评论 #28763700 未加载

评论 #28764085 未加载

评论 #28763702 未加载

评论 #28764335 未加载

tigerlily超过 3 年前

We want ramenporn

评论 #28763366 未加载

vitus超过 3 年前

评论 #28763280 未加载

评论 #28764580 未加载

评论 #28763179 未加载

评论 #28764457 未加载

评论 #28765178 未加载

评论 #28764649 未加载

评论 #28763722 未加载

评论 #28764445 未加载

nabakin超过 3 年前

Yesterday's post from Facebook<a href="https://engineering.fb.com/2021/10/04/networking-traffic/outage/" rel="nofollow">https://engineering.fb.com/2021/10/04/networking-traffic/out...</a>

fakeythrow8way超过 3 年前

i_like_apis超过 3 年前

Ansil849超过 3 年前

评论 #28763544 未加载

评论 #28764730 未加载

1vuio0pswjnm7超过 3 年前

评论 #28768394 未加载

nonbirithm超过 3 年前

评论 #28765389 未加载

评论 #28764092 未加载

评论 #28764178 未加载

评论 #28766462 未加载

codebolt超过 3 年前

评论 #28763421 未加载

评论 #28763754 未加载

评论 #28763424 未加载

cube00超过 3 年前

评论 #28764517 未加载

tantalor超过 3 年前

So it wasn't a config change, it was a command-of-death.

perryizgr8超过 3 年前

评论 #28769594 未加载

harshreality超过 3 年前

评论 #28764547 未加载

kaustubhvp超过 3 年前

Why were all BGP routes advertised from same set of servers in same DC which were pointed by ALL fb owned domains?

评论 #28770150 未加载

byron22超过 3 年前

What kind of bgp command would do that?

perryizgr8超过 3 年前

jimmyvalmer超过 3 年前

评论 #28763633 未加载

HenryKissinger超过 3 年前

评论 #28763224 未加载

评论 #28763134 未加载

评论 #28763010 未加载

评论 #28762969 未加载

jasonjei超过 3 年前

Will somebody lose a job over this?

评论 #28763957 未加载

评论 #28765087 未加载

henrypan1超过 3 年前

Wow

Jamie9912超过 3 年前

What would they have done if the whole data center was destroyed?

评论 #28763750 未加载

bilater超过 3 年前

I want to know what happened to the poor engineer who issued the command?

评论 #28763612 未加载

评论 #28764821 未加载

评论 #28763592 未加载

louwrentius超过 3 年前

> Our primary and out-of-band network access was downDon't create circular dependencies.

评论 #28763488 未加载

评论 #28763681 未加载

TedShiller超过 3 年前

During the outage, FB briefly made the world a better place

jbschirtzs超过 3 年前

"The Devil's Backbone"

halotrope超过 3 年前

It is completely logical but still kind of amazing, that facebook plugged their globally distributed datacenters together with physical wire.