So their actual deployment process is quite rigorous and should have a tight blast radius. After lots of emulated and canary testing, their deployments are phased out over weeks. I don't see how a bad push could have done what happened yesterday.<p>I found a paper that describes the process in detail. See page 10-11:<p><a href="https://web.archive.org/web/20211005034928/https://research.fb.com/wp-content/uploads/2021/03/Running-BGP-in-Data-Centers-at-Scale_final.pdf" rel="nofollow">https://web.archive.org/web/20211005034928/https://research....</a><p>Phase Specification<p>P1 Small number of RSWs in a random DC<p>P2 Small number of RSWs (> P1) in another random DC<p>P3 Small fraction of switches in all tiers in DC serving web traffic<p>P4 10% of switches across DCs (to account for site differences)<p>P5 20% of switches across DCs<p>P6 Global push to all switches<p>We classify upgrades in two classes: disruptive and non-disruptive, depending on if the upgrade affects existing forwarding state on the switch. Most upgrades
in the data center are non-disruptive (performance optimizations, integration with other systems, etc.). To minimize routing instabilities during non-disruptive upgrades, we use BGP
graceful restart (GR) [8]. When a switch is being upgraded,
GR ensures that its peers do not delete existing routes for a period of time during which the switch’s BGP agent/config is
upgraded. The switch then comes up, re-establishes the sessions with its peers and re-advertises routes. Since the upgrade
is non-disruptive, the peers’ forwarding state are unchanged.<p>Without GR, the peers would think the switch is down, and
withdraw routes through that switch, only to re-advertise them
when the switch comes back up after the upgrade.
Disruptive upgrades (e.g., changes in policy affecting existing switch forwarding state) would trigger new advertisements/withdrawals to switches, and BGP re-convergence
would occur subsequently. During this period, production traffic could be dropped or take longer paths causing increased
latencies. Thus, if the binary or configuration change is disruptive, we drain (§3) and upgrade the device without impacting production traffic. Draining a device entails moving
production traffic away from the device and reducing effective
capacity in the network. Thus, we pool disruptive changes
and upgrade the drained device at once instead of draining
the device for each individual upgrade.
Push Phases. Our push plan comprises six phases P1-P6 performed sequentially to apply the upgrades to agent/config in
production gradually.<p>We describe the specification of the 6
phases in Table 4. In each phase, the push engine randomly
selects a certain number of switches based on the phase’s
specification. After selection, the push engine upgrades these
switches and restarts BGP on these switches. Our 6 push
phases are to progressively increase scope of deployment with
the last phase being the global push to all switches. P1-P5 can
be construed as extensive testing phases: P1 and P2 modify
a small number of rack switches to start the push. P3 is our
first major deployment phase to all tiers in the topology.<p>We
choose a single data center which serves web traffic because
our web applications have provisions such as load balancing
to mitigate failures. Thus, failures in P3 have less impact
to our services. To assess if our upgrade is safe in more diverse settings, P4 and P5 upgrade a significant fraction of our
switches across different data center regions which serve different kinds of traffic workloads. Even if catastrophic outages
occur during P4 or P5, we would still be able to achieve high performance connectivity due to the in-built redundancy in the
network topology and our backup path policies—switches running the stable BGP agent/config would re-converge quickly
to reduce impact of the outage. Finally, in P6, we upgrade the
rest of the switches in all data centers.<p>Figure 7 shows the timeline of push releases
over a 12 month period. We achieved 9 successful pushes of
our BGP agent to production. On average, each push takes
2-3 weeks