"Out of Band" network management is not trivial

117 点作者 DanAtC10 个月前

18 条评论

Animats10 个月前

In the entire history of the Bell System, no electromechanical exchange was ever down for more than 30 minutes for any reason other than a natural disaster. With one exception, a major fire in New York City. Three weeks of downtime for 170,000 phones for that.[1] The Bell System pulled in resources and people from all over the system to replace and rewire several floors of equipment and cabling.That record has not been maintained in the digital era.The long distance system did not originally need the Bedminster, NJ network control center to operate. Bedminster sent routing updates periodically to the regional centers, but they could fall back to static routing if necessary. There was, by design, no single point of failure. Not even close. That was a basic design criterion in telecom prior to electronic switching. The system was designed to have less capacity but still keep running if parts of it went down.[1] <a href="https://www.youtube.com/watch?v=f_AWAmGi-g8" rel="nofollow">https://www.youtube.com/watch?v=f_AWAmGi-g8</a>

评论 #40895855 未加载

评论 #40897337 未加载

评论 #40912622 未加载

评论 #40897908 未加载

评论 #40898117 未加载

Scoundreller10 个月前

One thing that was fascinating about the Rogers outage was on the wireless side: because "just" the core was down, the towers were still up.So mobile phones would try to make a connection to the tower just enough to connect but not be able to do anything, like call 9-1-1 without trying to fail-over to other mobile networks. Devices showed zero bars, but field test mode would show some handshake succeeding.(The CTO was roaming out-of-country, had zero bars and thought nothing of it... how they had no idea an enterprise-risking update was scheduled, we'll never know)Supposedly you could remove your SIM card (who carries that tool doohickey with them at all times?), or disable that eSIM, but you'd have to know that you can do that. Unsure if you'd still be at the mercy of Rogers being the most powerful signal and still failing to get your 9-1-1 call through.Rogers claimed to have no ability to power down the towers without a truck-roll (which is how another aspect where widespread OOB could have come in handy).Various stories of radio stations (which Rogers also owns a lot of) not being able to connect the studio to the transmitter, so some tech went with an mp3 player to play pre-recorded "evergreen" content. Others just went off-air.<a href="https://www.theregister.com/2022/07/25/canadian_isp_rogers_outage/" rel="nofollow">https://www.theregister.com/2022/07/25/canadian_isp_rogers_o...</a>

评论 #40895982 未加载

评论 #40896349 未加载

评论 #40896784 未加载

评论 #40897795 未加载

gavindean9010 个月前

I’m reminded of when an old AT&T building went on sale as a house, and one of its selling points was that you could get power from two different power companies if you wanted. This highlighted to me the level of redundancy required to take such things seriously. It probably cost the company a lot to hook up the wires, and I doubt the second power company paid anything for the hookup. Big Bell did it there, and I’m sure they did it everywhere else too.Edit: I bet it had diesel generators when it was in service with AT&T to boot.

评论 #40895941 未加载

评论 #40895659 未加载

评论 #40895394 未加载

transcriptase10 个月前

It’s trivial when you have the resources that come from being one of Canada’s 3 telecom oligopoly members.Unfortunately the CRTC is run by former execs/management of Bell, Telus, and Rogers, and our anti-competition bureau doesn’t seem to understand their purpose when they consistently allow these 3 to buy up and any all small competitors that gain even a regional market share.Meanwhile their service is mediocre and overpriced, which they’ll chalk up to geographical challenges of operating in Canada while all offering the exact same plans at the exact same prices, buying sports teams, and paying a reliable dividend.

评论 #40895675 未加载

1992spacemovie10 个月前

There is OOB for carriers and OOB for non-carriers. OOB for carriers is significantly more complex and resource intensive than OOB for non-carriers. This topic (OOB or to forgo) has been beat to death over the last 20 years in the operator circles; the responsible consensus is trying to shave a % off operating expenses by cheaping out on your OOB is wrong. That said it does shock me that one of the tier-1 carriers in Canada was this... ignorant? Did they never expect it to rain or something? Wild.

goatsi10 个月前

When I see out of band management at remote locations (usually for a dedicated doctors network run by the health authority that gets deployed at offices and clinics) it's generally analog phone line -> modem -> console port. Dialup is more than enough if all you need to do is reset a router config.Not 100% out of band for a telco though, unless they made sure to use a competitors lines.

评论 #40895924 未加载

ChuckMcM10 个月前

Reminds me of a data center that said they had a backup connection and I pointed out that only one fiber was coming into the data center. They said, "Oh its on a different lambda[1]" :-)[1] Wave division multiplexing sends multiple signals over the same fiber by using different wavelengths for different channels. Each wavelength is sometimes referred to as a lambda.

knocknock10 个月前

My previous org OOB used a data only SIM card from a different service provider. Curious why that wouldn't be a good solution?

评论 #40895563 未加载

ralferoo10 个月前

From TFA:> If your OOB network is your only way of managing things, you not only have to build a separate network, you have to make sure it is fully redundant, because otherwise you've created a single point of failure for (some) management.I'm not sure I necessarily agree with that. You can set up the network in such a way that you can route over the main network as a backup if your OOB network was down but the main network was up. Obviously, it's not quite as simple as sticking a patch cable between the two networks, but it can be close - you have a machine that's always on your OOB network, and it has an additional port that either configures itself over DHCP or has a hard-coded IP for the main net. But the important thing is that you never have that patched in, except for emergencies like your OOB network cable being severed but you still have access to the main network. If that does happen, you plug it in temporarily and use that machine as a proxy. There's no real reason for extra redundancy in the OOB, because if your main uplink is also severed, there's not really much you're going to be usefully configuring anyway!

评论 #40896994 未加载

kkfx10 个月前

Apart from Rogers et alike, the main OOB/LOM issue is that's mostly only very old iron very few know, finding people who knows and finding non-hacky homegrown and not much tested solutions it's damn hard.

synack10 个月前

With launch costs dropping, I wonder if there’s a market for a low bandwidth “ssh via satellite” service. Could use AWS Ground Station to connect to your VPC.

评论 #40895624 未加载

walterbell10 个月前

> hardened in-band managementWhat would this look like in practice? Management interfaces like BSPs don't have a great security track record.

评论 #40895417 未加载

jeffrallen10 个月前

I love ChrisO so much, and it's funny but often he's talking about something I'm currently working on too.Thank you to Chris and to whoever posts his articles here.

TwoNineFive10 个月前

The blog post is weird. "Rogers didn't even try, so OOB is hard."Also this sentence makes me question his IQ:"Some people have gone so far as to suggest that out of band network management is an obvious thing that everyone should have"Yes Chris, Rogers, the monopoly telco company of Canada, should have OOB network! They can afford it.Talking about the challenges of OOB is great, but the point the blog post is wrong and dumb.The report says "Rogers had a management network that relied on the Rogers IP core network". They had no OOB network. They didn't even try.This is a a symptom of Rogers status as a monopoly, negligence on the behalf of Rogers, and negligence on the behalf of the government who should have regulated OOB into existence. This is some serious clown car shit.One of the advantages that competitor networks provides is redundancy. Canada doesn't have that, so their networks will remain weak. This will probably happen again some day.Yes OOB is hard, but not even trying and then throwing up your hands and defending the negligent is stupid.

ianpenney10 个月前

Ham radio. Meshtastic. Knowing your neighborhood.

pharos9210 个月前

I disagree, Out of Band Network Management (OOBM) is extremely trivial to implement. Most companies however don't see the value of OOBM until they have a major fault. The setup costs can be high, and the ongoing operational costs of OOBM infrastructure and links is also significant. I've built dozens of OOBM networks using fibre and 4G with the likes of Opengear. In instances, often deploying OOBM ahead of infrastructure rollouts so hardware can be delivered to site directly from factory, rather than go through a staging environment which adds time, cost and complexity.

评论 #40895539 未加载

评论 #40895576 未加载

评论 #40895670 未加载

kjellsbells10 个月前

I worry that this misses the point a little. All the OOB in the world will not help you if you cannot reach the management entity (eg IP-enabled PSU, terminal server, etc). It is also insufficient to protect against second order thundering-herd-type problems (e.g.: you log in, stop a worker process, and upstream, traffic is directed away from the node to the others, and starts causing new problems).In telco operations, every MoP should have: an unambiguous linear sequence of steps, a procedure to verify that the desired result has been achieved, and a backout plan if things do go bad. This is drilled into you at every telco I ever worked at. Rogers' cardinal sin on the day of the outage was that they didn't have a backout plan at each step of the MoP.More structurally, networks have a dependency graph that you ignore at your peril. X depends on Y depend on Z, and so on. And yes, loops are quite possible! OOB management is an attempt to add new links to the graph that only get used in a crisis. These kind of pull-it-out-when-you-need-it solutions are fine, but have a tendency to fail just when you need them. For one, they don't get exercised enough, and two, they may have their own dependencies on the graph that are not realized until too late.So, what would this Internet rando prescribe? First order of business is to enumerate the dependency graph. I would wager that BGP, DNS, and the identity system are at or near the very top. Notice the deadly embrace of DNS and ID: if DNS is down, ID fails.Next, study the failure modes of the elements. In the Rogers outage, a lack of route filters crashed a core router. That's a vague word, "crashed". Are we talking core dumps and SEGVs? Are we talking response times that skyrocketed, leading to peers timing out? Rogers really need to understand that. Typically in telco networks when nodes get "congested" like this there are escape valves built into the control plane protocol, eg a response that says "please back off and retry in rand(300)". They need to have a conversation with Cisco/Juniper etc and their router gurus about this.Finally, the telco industry (or what's left of it) needs to do some introspection about the direction it is pulling vendors. For the last 15 years, telcos have been convinced that if only that can ingest some of that sweet, sweet cloud juice, their software costs will drop, they can slash operations costs, and watch the share price go brrr. Problem is, replacing legacy systems with ones cobbled together by vendors from a patchwork of kubernetes and prayers is guaranteed not to lead to the level of reliability that telcos and their regulators expect. If I'm a Rogers' operations manager and my network dies, I don't want to hear that some dude in India has to spend the next week picking through a service mesh and experimenting with multus to decide if turning if off and on again is gonna work.

评论 #40899286 未加载

bigcat1234567810 个月前

Who said it is trivial?... Edit: The article take a title and describe some straightforward technical and business investments to make oob management network work.

评论 #40895660 未加载