Update about the October 4th outage

274 pointsby ve55over 3 years ago

39 comments

imgabeover 3 years ago

It just occurred to me to wonder if Facebook has a Twitter account and if they used it to update people about the outage. It turns out they do, and they did, which makes sense. Boy, it must have been galling to have to use a competing communication network to tell people that your network is down.It looks like Zuckerberg doesn't have a personal Twitter though, nor does Jack Dorsey have a public Facebook page (or they're set not to show up in search).

评论 #28756170 未加载

评论 #28756671 未加载

评论 #28756419 未加载

geerlingguyover 3 years ago

Gotta love how painfully vague this is. Sounds like a PR piece for investors, not an engineering blog piece.

评论 #28755063 未加载

评论 #28755267 未加载

评论 #28756430 未加载

评论 #28755066 未加载

评论 #28756296 未加载

评论 #28760806 未加载

评论 #28755091 未加载

评论 #28756053 未加载

stephenhueyover 3 years ago

Even though the angle grinder story wasn’t accurate, it’d still be interesting to know what percentage of the time to fix the outage was spent on regaining physical access:<a href="https://mobile.twitter.com/mikeisaac/status/1445196576956162050?s=21" rel="nofollow">https://mobile.twitter.com/mikeisaac/status/1445196576956162...</a>

go_prodevover 3 years ago

I worked with a network engineer who misconfigured a router that was connecting a bank to it's DR site. The engineer had to drive across town to manually patch into the router to fix it.DR downtime was about an hour, but the bank fired him anyway.Given that Zuck lost a substantial amount of money, I wonder if the engineer faced any ramifications.Sidenote: I asked the bank infrastructure team why the DR site was in the same earthquake zone, and they thought I was crazy. They said if there's an earthquake we'll have bigger problems to deal with.

评论 #28757398 未加载

评论 #28759057 未加载

评论 #28757990 未加载

评论 #28757566 未加载

gannon-over 3 years ago

This is a funny post to have suggested at the bottom of the article: <a href="https://engineering.fb.com/2021/08/09/connectivity/backbone-management/" rel="nofollow">https://engineering.fb.com/2021/08/09/connectivity/backbone-...</a>

评论 #28755075 未加载

lionkorover 3 years ago

So this is pure conspiracy theory, but to me this could be a security issue. What if something deep in the core of your infrastructure is compromised? Everything at risk? Id ask my best engineer, hed suggest to shut it down, and the best way to do that is to literally pull the plug on what makes you public. Tell everyone we accidentally messed up a BGP and thats it.But yeah, likely not.

评论 #28756848 未加载

评论 #28757684 未加载

评论 #28757696 未加载

评论 #28757206 未加载

评论 #28760876 未加载

runawaybottleover 3 years ago

It was interesting to visit the subreddits of random countries (eg /r/Mongolia) and see the top posts all asking if fb/Insta/WhatsApp being down was local or global. I got the impression this morning that it was only affecting NA and Europe, but it looks like it was totally global. The numbers must be staggering of the number of people trying to login.

andrewxdiamondover 3 years ago

This more or less confirms what we’ve heard, and I appreciate the speed, but it’s incredibly lame from a details point of view.Will a real postmortem follow? Or is this the best we are gonna get?

评论 #28756004 未加载

supermattover 3 years ago

Sounds like they could do with some updates to their risk-driven backbone management strategy!<a href="https://engineering.fb.com/2021/08/09/connectivity/backbone-management/" rel="nofollow">https://engineering.fb.com/2021/08/09/connectivity/backbone-...</a>

raverbashingover 3 years ago

The badge story only shows how people are looking for "efficiency" where it doesn't matter, with predictable results.The badge system should be local to the building. There are few actual reasons (sure, besides "efficiency") of why badge control should be centralized. Even less reasons for it to be a subdomain of fb. Another option would be to keep the system but make it failsafe (but it seems the newer generation doesn't know what that means). If the network goes down keep it at the last config. Badge validation should be offline first and added/removed ones should be broadcast periodically.This is the same issue with smartlocks times the number of employees. Do you really want to add another point of failure between yourself and your home?

评论 #28756573 未加载

评论 #28759224 未加载

crtasmover 3 years ago

"To all the people and businesses around the world who depend on us, " ... yesterday was another example of why you shouldn't depend on us to such an extent.

cheesecake_luvrover 3 years ago

On a side note: when I browse to that page in Firefox (92.0.1) from HN I can't go back to HN - the back arrow is disabled. What gives?

评论 #28756348 未加载

metissec98over 3 years ago

Well that doesn't say a whole lot... I know it is early but they could use a little more detail. Even if it is just a timeline.

paxysover 3 years ago

It was quite ironic that while every Facebook property was offline there was an immense amount of misinformation about the incident perpetuated across the internet (including right here on HN) which everyone just believed as fact.

评论 #28755062 未加载

评论 #28755064 未加载

评论 #28755631 未加载

评论 #28755059 未加载

评论 #28755993 未加载

评论 #28755050 未加载

评论 #28755619 未加载

dev_tty01over 3 years ago

>We also have no evidence that user data was compromised as a result of this downtime.No, that just happens during uptime.

shahsyedover 3 years ago

> configuration changes on the backbone routers that coordinate network traffic between our data centers caused issuesThis could be anything, potentially.I'm not very knowledgeable in computer networking, but this could be as trivial as an incorrect update to a DNS record, right?

评论 #28755312 未加载

评论 #28755076 未加载

niko001over 3 years ago

It would be interesting to estimate what dollar value can be ascribed to a x-hour FB outage, both in terms of lost ad revenue for FB itself as well missed conversions/revenue for businesses running ads on FB/IG.

评论 #28756720 未加载

评论 #28757571 未加载

0xyover 3 years ago

Knowing almost nothing about networking, isn't the way Facebook handles networking somewhat of a monolithic anti-pattern? Why is a single update responsible for taking out multiple services and why wouldn't each product or even each region within each product have their own routes, for resiliency which can then be used to rollout changes slower?By having a large centralized and monolithic system, aren't they guaranteeing that mistakes cause huge splash damage and don't separate concerns?

评论 #28755015 未加载

评论 #28755029 未加载

评论 #28754999 未加载

评论 #28754996 未加载

评论 #28755042 未加载

评论 #28754993 未加载

评论 #28754988 未加载

评论 #28756040 未加载

评论 #28754989 未加载

advpetcover 3 years ago

Just out of curiosity, does Facebook have a status page? Like <a href="http://status.twitter.com" rel="nofollow">http://status.twitter.com</a>?

评论 #28755284 未加载

Jugurthaover 3 years ago

The first thing people here thought of was that it was the gouvernement denying access to these websites as it usually does for a number of reasons.

评论 #28754978 未加载

dugoover 3 years ago

Around the turn of the century, in a network the size of Europe, we had OOB comms to the core routers via ISDN/POTS. We experimented with mobile phones in the racks as well, much to the chagrin of the old telco guys running the PoPs.

stormdennisover 3 years ago

The mobile whatsapp app should notify that the whatsapp servers are down and not allow you to just send messages that won't arrive for six hours

评论 #28757101 未加载

评论 #28756856 未加载

sydthrowawayover 3 years ago

Any FB throwaway know if someone got fired for this?

评论 #28756039 未加载

评论 #28756061 未加载

评论 #28756028 未加载

评论 #28756037 未加载

dr_hoooover 3 years ago

Why is this non-post on the frontage? It's PR only

wyldfireover 3 years ago

Move fast andNO CARRIER

reilly3000over 3 years ago

So their actual deployment process is quite rigorous and should have a tight blast radius. After lots of emulated and canary testing, their deployments are phased out over weeks. I don't see how a bad push could have done what happened yesterday.I found a paper that describes the process in detail. See page 10-11:<a href="https://web.archive.org/web/20211005034928/https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf" rel="nofollow">https://web.archive.org/web/20211005034928/https://research....</a>Phase SpecificationP1 Small number of RSWs in a random DCP2 Small number of RSWs (> P1) in another random DCP3 Small fraction of switches in all tiers in DC serving web trafficP4 10% of switches across DCs (to account for site differences)P5 20% of switches across DCsP6 Global push to all switchesWe classify upgrades in two classes: disruptive and non-disruptive, depending on if the upgrade affects existing forwarding state on the switch. Most upgrades in the data center are non-disruptive (performance optimizations, integration with other systems, etc.). To minimize routing instabilities during non-disruptive upgrades, we use BGP graceful restart (GR) [8]. When a switch is being upgraded, GR ensures that its peers do not delete existing routes for a period of time during which the switch’s BGP agent/config is upgraded. The switch then comes up, re-establishes the sessions with its peers and re-advertises routes. Since the upgrade is non-disruptive, the peers’ forwarding state are unchanged.Without GR, the peers would think the switch is down, and withdraw routes through that switch, only to re-advertise them when the switch comes back up after the upgrade. Disruptive upgrades (e.g., changes in policy affecting existing switch forwarding state) would trigger new advertisements/withdrawals to switches, and BGP re-convergence would occur subsequently. During this period, production traffic could be dropped or take longer paths causing increased latencies. Thus, if the binary or configuration change is disruptive, we drain (§3) and upgrade the device without impacting production traffic. Draining a device entails moving production traffic away from the device and reducing effective capacity in the network. Thus, we pool disruptive changes and upgrade the drained device at once instead of draining the device for each individual upgrade. Push Phases. Our push plan comprises six phases P1-P6 performed sequentially to apply the upgrades to agent/config in production gradually.We describe the specification of the 6 phases in Table 4. In each phase, the push engine randomly selects a certain number of switches based on the phase’s specification. After selection, the push engine upgrades these switches and restarts BGP on these switches. Our 6 push phases are to progressively increase scope of deployment with the last phase being the global push to all switches. P1-P5 can be construed as extensive testing phases: P1 and P2 modify a small number of rack switches to start the push. P3 is our first major deployment phase to all tiers in the topology.We choose a single data center which serves web traffic because our web applications have provisions such as load balancing to mitigate failures. Thus, failures in P3 have less impact to our services. To assess if our upgrade is safe in more diverse settings, P4 and P5 upgrade a significant fraction of our switches across different data center regions which serve different kinds of traffic workloads. Even if catastrophic outages occur during P4 or P5, we would still be able to achieve high performance connectivity due to the in-built redundancy in the network topology and our backup path policies—switches running the stable BGP agent/config would re-converge quickly to reduce impact of the outage. Finally, in P6, we upgrade the rest of the switches in all data centers.Figure 7 shows the timeline of push releases over a 12 month period. We achieved 9 successful pushes of our BGP agent to production. On average, each push takes 2-3 weeks

评论 #28759045 未加载

r00tanonover 3 years ago

"Post hoc ergo propter hoc"

r00tanonover 3 years ago

Remember, remember, the 4th of October.

r00tanonover 3 years ago

Yes. It is true. If you enter Facebook into Facebook. It will break the internet.

Elyes-ghorbelover 3 years ago

Could you please be more clear about ''no evidence that user data was compromised''

herald67over 3 years ago

Do you think DLT/ blockchain can minimize this from happening again in the future?

评论 #28756890 未加载

trthompsover 3 years ago

Reading this statement all I can think of is this scene <a href="https://www.youtube.com/watch?v=15HTd4Um1m4" rel="nofollow">https://www.youtube.com/watch?v=15HTd4Um1m4</a>

eyelidlessnessover 3 years ago

One of the things they restored was annoying sounds in the app every time I tap anything. Who knew that was DNS related!

评论 #28768000 未加载

1970-01-01over 3 years ago

TL;DRWe YOLO'd our BGP experiment to prod. It failed.<a href="https://web.archive.org/web/20210626191032/https://engineering.fb.com/2021/05/13/data-center-engineering/bgp/" rel="nofollow">https://web.archive.org/web/20210626191032/https://engineeri...</a>

dave333over 3 years ago

I thought DARPA designed the internet to survive nuclear war - no single point of failure - clearly Facebook's network breaks that rule. They need a DNS of last resort that doesn't update fast.

评论 #28755370 未加载

评论 #28757054 未加载

andy-xover 3 years ago

Such a BS. FB imagining that they are their own Internet but failing in a most miserable way because they need actual Internet to communicate.

coliveiraover 3 years ago

The best course of action is to split FB into separate companies. It is already neatly divided between instagram, WU and legacy facebook. That would be the best for the government to avoid disruptions.

评论 #28755123 未加载

vishesh92over 3 years ago

> We also have no evidence that user data was compromised as a result of this downtime.I am not sure why they had to mention this specifically. This makes it sound like an external attack.

评论 #28755687 未加载

评论 #28755948 未加载

rvzover 3 years ago

It has been painfully admitted by the Facebook mafia that they know that they are the internet and farming the data of an entire civilisation; further evidence that this deep integration of their services needs to be broken up.After all the scandals, leaks, whistleblowers etc it would take more than a DNS record wipe to take down the Facebook mafia.

39 comments

imgabeover 3 years ago

评论 #28756170 未加载

评论 #28756671 未加载

评论 #28756419 未加载

geerlingguyover 3 years ago

Gotta love how painfully vague this is. Sounds like a PR piece for investors, not an engineering blog piece.

评论 #28755063 未加载

评论 #28755267 未加载

评论 #28756430 未加载

评论 #28755066 未加载

评论 #28756296 未加载

评论 #28760806 未加载

评论 #28755091 未加载

评论 #28756053 未加载

stephenhueyover 3 years ago

go_prodevover 3 years ago

评论 #28757398 未加载

评论 #28759057 未加载

评论 #28757990 未加载

评论 #28757566 未加载

gannon-over 3 years ago

评论 #28755075 未加载

lionkorover 3 years ago

评论 #28756848 未加载

评论 #28757684 未加载

评论 #28757696 未加载

评论 #28757206 未加载

评论 #28760876 未加载

runawaybottleover 3 years ago

andrewxdiamondover 3 years ago

评论 #28756004 未加载

supermattover 3 years ago

raverbashingover 3 years ago

评论 #28756573 未加载

评论 #28759224 未加载

crtasmover 3 years ago

"To all the people and businesses around the world who depend on us, " ... yesterday was another example of why you shouldn't depend on us to such an extent.

cheesecake_luvrover 3 years ago

On a side note: when I browse to that page in Firefox (92.0.1) from HN I can't go back to HN - the back arrow is disabled. What gives?

评论 #28756348 未加载

metissec98over 3 years ago

Well that doesn't say a whole lot... I know it is early but they could use a little more detail. Even if it is just a timeline.

paxysover 3 years ago

评论 #28755062 未加载

评论 #28755064 未加载

评论 #28755631 未加载

评论 #28755059 未加载

评论 #28755993 未加载

评论 #28755050 未加载

评论 #28755619 未加载

dev_tty01over 3 years ago

>We also have no evidence that user data was compromised as a result of this downtime.No, that just happens during uptime.

shahsyedover 3 years ago

评论 #28755312 未加载

评论 #28755076 未加载

niko001over 3 years ago

评论 #28756720 未加载

评论 #28757571 未加载

0xyover 3 years ago

评论 #28755015 未加载

评论 #28755029 未加载

评论 #28754999 未加载

评论 #28754996 未加载

评论 #28755042 未加载

评论 #28754993 未加载

评论 #28754988 未加载

评论 #28756040 未加载

评论 #28754989 未加载

advpetcover 3 years ago

Just out of curiosity, does Facebook have a status page? Like <a href="http://status.twitter.com" rel="nofollow">http://status.twitter.com</a>?

评论 #28755284 未加载

Jugurthaover 3 years ago

The first thing people here thought of was that it was the gouvernement denying access to these websites as it usually does for a number of reasons.

评论 #28754978 未加载

dugoover 3 years ago

stormdennisover 3 years ago

The mobile whatsapp app should notify that the whatsapp servers are down and not allow you to just send messages that won't arrive for six hours

评论 #28757101 未加载

评论 #28756856 未加载

sydthrowawayover 3 years ago

Any FB throwaway know if someone got fired for this?

评论 #28756039 未加载

评论 #28756061 未加载

评论 #28756028 未加载

评论 #28756037 未加载

dr_hoooover 3 years ago

Why is this non-post on the frontage? It's PR only

wyldfireover 3 years ago

Move fast andNO CARRIER

reilly3000over 3 years ago

评论 #28759045 未加载

r00tanonover 3 years ago

"Post hoc ergo propter hoc"

r00tanonover 3 years ago

Remember, remember, the 4th of October.

r00tanonover 3 years ago

Yes. It is true. If you enter Facebook into Facebook. It will break the internet.

Elyes-ghorbelover 3 years ago

Could you please be more clear about ''no evidence that user data was compromised''

herald67over 3 years ago

Do you think DLT/ blockchain can minimize this from happening again in the future?

评论 #28756890 未加载

trthompsover 3 years ago

Reading this statement all I can think of is this scene <a href="https://www.youtube.com/watch?v=15HTd4Um1m4" rel="nofollow">https://www.youtube.com/watch?v=15HTd4Um1m4</a>

eyelidlessnessover 3 years ago

One of the things they restored was annoying sounds in the app every time I tap anything. Who knew that was DNS related!

评论 #28768000 未加载

1970-01-01over 3 years ago

dave333over 3 years ago

I thought DARPA designed the internet to survive nuclear war - no single point of failure - clearly Facebook's network breaks that rule. They need a DNS of last resort that doesn't update fast.

评论 #28755370 未加载

评论 #28757054 未加载

andy-xover 3 years ago

Such a BS. FB imagining that they are their own Internet but failing in a most miserable way because they need actual Internet to communicate.

coliveiraover 3 years ago

评论 #28755123 未加载

vishesh92over 3 years ago

> We also have no evidence that user data was compromised as a result of this downtime.I am not sure why they had to mention this specifically. This makes it sound like an external attack.

评论 #28755687 未加载

评论 #28755948 未加载

rvzover 3 years ago