> Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.<p>I'm so glad to see that they framed this in terms of a bug in a tool designed to prevent human error, rather than simply blaming it on human error.
Interesting bit on recovery w.r.t. the electrical grid<p>> flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems ...<p>I wish there was a bit more detail in here. What's the worst case there? Brownouts, exploding transformers? Or less catastrophic?
Incidentally the facebook app itself really handled this gracefully. When the app can't connect to facebook, it displays "updates" from a pool of cached content. It looks and feels like facebook is there, but we know it's not. I didn't notice this until the outage and I thought it was neat.
Note that contrary to popular reports, DNS was NOT to blame for this outage — for once DNS worked exactly as per per the spec, design and configuration:<p>> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.
> One of the jobs performed by our smaller facilities is to respond to DNS queries. DNS is the address book of the internet, enabling the simple web names we type into browsers to be translated into specific server IP addresses.<p>What is the target audience of this post? It is too technical for non-technical people, but also it is dumbed down to try to include people that does not know how the internet works. I feel like I'm missing something.
You can see the security/reliability tradeoff problem here.<p>You need a control plane. But what does it run over? Your regular data links? A problem if it also controls their configuration. Something outside your own infrastructure, like a modest connection to the local ISP as a backup? That's an attack vector.<p>One popular solution is to keep both the current and previous generation of the control plane up. Both Google and AT&T seem to have done that. AT&T kept Signalling System 5 up for years after SS7 was doing all the work. Having two totally different technologies with somewhat different paths is helpful.
I'm confused by this. Are the DNS servers inside the backbone, or outside?<p><pre><code> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.</code></pre>
Google had a comparable outage several years ago.<p><a href="https://status.cloud.google.com/incident/cloud-networking/19009" rel="nofollow">https://status.cloud.google.com/incident/cloud-networking/19...</a><p>This event left a lot of scar tissue across all of Technical Infrastructure, and the next few months were not a fun time (e.g. a mandatory training where leadership read out emails from customers telling us how we let them down and lost their trust).<p>I'd be curious to see what systemic changes happen at FB as a result, if any.
Yesterday's post from Facebook<p><a href="https://engineering.fb.com/2021/10/04/networking-traffic/outage/" rel="nofollow">https://engineering.fb.com/2021/10/04/networking-traffic/out...</a>
For a lot of people in countries outside the US, Facebook _is_ the internet. Facebook has cut deals with various ISPs outside the US to allow people to use their services without it costing any data. Facebook going down is a mild annoyance for us but a huge detriment to, say, Latin America.
<p><pre><code> the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.
this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them.
</code></pre>
Sounds like it was the perfect storm.
> The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers.<p>This makes it sound like Facebook has physically laid "tens of thousands of miles of fiber-optic cables crossing the globe and linking all our data centers". Is this in fact true?
"Those translation queries are answered by our authoritative name servers that occupy well known IP addresses themselves, which in turn are advertised to the rest of the internet via another protocol called the border gateway protocol (BGP)."<p>"To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection."<p>Correct me if I am wrong, but here "DNS servers" means the computers, not the software running on them, i.e., each computer is running both DNS software and a BGP daemon. I am not aware of DNS server software that disables BGP advertisements but a BGP daemon could do it.<p>For example, a BGP daemon like ExaBGP can execute a DNS query, check the output and disable advertisements if the query fails.<p><a href="https://github.com/Exa-Networks/exabgp" rel="nofollow">https://github.com/Exa-Networks/exabgp</a>
DNS seems to be a massive point of failure everywhere, even taking out the tools to help deal with outages themselves. The same thing happened to Azure multiple times in the past, causing complete service outages. Surely there must be some way to better mitigate DNS misconfiguration by now, given the exceptional importance of DNS?
Apparently they had to bring in the angle grinder to get access to the server room.<p><a href="https://twitter.com/cullend/status/1445156376934862848?t=P5ua0Fk7iPT5g-05bgaA8w&s=19" rel="nofollow">https://twitter.com/cullend/status/1445156376934862848?t=P5u...</a>
<i>> We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this.</i><p>If you correctly design your security with appropriate fall backs you don't need to make this trade off.<p>If that story of the Facebook campus having no physical key holes on doors is true it just speaks to an arrogance of assuming things can never fail so we don't even need to bother planning for it.
> Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.<p>Interesting that they think fluctuations of tens of megawatts would risk electrical systems. If the equipment was handling that much continuous load, wouldn't it also easily handle the resumption of the same load? Also I totally did not understand how power usage would affect caches.
> To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection.<p>No, it's (clearly) not a guaranteed indication of that. Logic fail. Infrastructure tools at that scale need to handle all possible causes of test failures. "Is the internet down or only the few sites I'm testing?" is a classic network monitoring script issue.
When facebook is directly peering to so many other ASes, why would they not have static routes in place for those direct links? Why run BGP for that? It's not like there is going to be a better route than the direct link. If the link goes down, then you can rely on BGP to reroute.
tldr; a maintenance query was issued that inexplicably severed FB's data
centers from the internet, which unnecessarily caused their DNS servers to
mark themselves defunct, which made it all but impossible for their guys to
repair the problem from HQ, which compelled them to physically dispatch field
units whose progress was stymied by recent increased physical security
measures.
> During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally<p>Imagine being this person.<p>Tomorrow on /r/tifu.
I am curious about something.<p>It has been quite a while since i had any job that required me to think about DCs.<p>Back in the day we would have a setup of regular modems.<p>If all hell broke loose, then we could dial up a modem and
have access to the system. It was not fast, and it was a pain, but we could get access that way.<p>(I am skipping a lot of steps. There was heavy security involved in getting access)<p>I guess landlines might not be an option anymore??
I don't know the -f in rm -rf isn't a bug xD. I feel sorry for the poor engineer who fat fingered the command. It definitely highlights an anti-pattern in the command line but the fact that, that singular console had the power to effect the entire network highlights an "interesting" design choice indeed.
> We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this. From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.<p>I found this to be an extremely deceptive conclusion. This makes it sound like the issue was that Facebook's physical security is just too gosh darn good. But the issue was not Facebook's data center physical security protocols. The issue was glossed over in the middle of the blogpost:<p>> Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool didn’t properly stop the command.<p>The issue was faulty audit code. It is disingenuous to then attempt to spin this like the downtime was due to Facebook's amazing physec protocols.
Was this a way to delete a lot of evidence before shit really hit the fan?'<p>After reading this, I can't help but feel this was a calculated move.<p>It gives FB a chance to hijack media attention from the whistleblower. It gives them a chance to show the average peson, 'hey, we make mistakes and we have a review process to improve our systems'.<p>The timing is too perfect if you ask me.
It's a no apologies messages:<p>"We failed, our processes failed, our recovery process only partially worked, we celebrate failure. Our investors were not happy, our users were not happy, some people probably ended in physically dangerous situations due to WhatsApp being unavailable but it's ok. We believe a tradeoff like this is worth it."<p>- Your engineering team.