I worry that this misses the point a little. All the OOB in the world will not help you if you cannot reach the management entity (eg IP-enabled PSU, terminal server, etc). It is also insufficient to protect against second order thundering-herd-type problems (e.g.: you log in, stop a worker process, and upstream, traffic is directed away from the node to the others, and starts causing new problems).<p>In telco operations, every MoP should have: an unambiguous linear sequence of steps, a procedure to verify that the desired result has been achieved, and a backout plan if things do go bad. This is drilled into you at every telco I ever worked at. Rogers' cardinal sin on the day of the outage was that they didn't have a backout plan at each step of the MoP.<p>More structurally, networks have a dependency graph that you ignore at your peril. X depends on Y depend on Z, and so on. And yes, loops are quite possible! OOB management is an attempt to add new links to the graph that only get used in a crisis. These kind of pull-it-out-when-you-need-it solutions are fine, but have a tendency to fail just when you need them. For one, they don't get exercised enough, and two, they may have their own dependencies on the graph that are not realized until too late.<p>So, what would this Internet rando prescribe? First order of business is to enumerate the dependency graph. I would wager that BGP, DNS, and the identity system are at or near the very top. Notice the deadly embrace of DNS and ID: if DNS is down, ID fails.<p>Next, study the failure modes of the elements. In the Rogers outage, a lack of route filters crashed a core router. That's a vague word, "crashed". Are we talking core dumps and SEGVs? Are we talking response times that skyrocketed, leading to peers timing out? Rogers really need to understand that. Typically in telco networks when nodes get "congested" like this there are escape valves built into the control plane protocol, eg a response that says "please back off and retry in rand(300)". They need to have a conversation with Cisco/Juniper etc and their router gurus about this.<p>Finally, the telco industry (or what's left of it) needs to do some introspection about the direction it is pulling vendors. For the last 15 years, telcos have been convinced that if only that can ingest some of that sweet, sweet cloud juice, their software costs will drop, they can slash operations costs, and watch the share price go brrr. Problem is, replacing legacy systems with ones cobbled together by vendors from a patchwork of kubernetes and prayers is guaranteed not to lead to the level of reliability that telcos and their regulators expect. If I'm a Rogers' operations manager and my network dies, I don't want to hear that some dude in India has to spend the next week picking through a service mesh and experimenting with multus to decide if turning if off and on again is gonna work.