As always, it's really impressive to see how much technical detail they release publicly in their RCAs. It sets a good example for the industry.<p>Also — quite impressive to make major infrastructure and architecture changes in a few months. Not every organization can pull that off.
Two major outages less than half a year a part, but with wildly different outcomes. It's definitely showing their engineering capabilities were targeted at the correct outcomes.<p>Would definitely be interested to see the detailed RCA on the power side of things. Not many people really think about Layer 0 on the stack.
Someone set up the breakers incorrectly way back when, and they were never adjusted. I'll bet it's not possible to adjust those without powering off the downstream equipment.<p>It reminds me of the amazon guy discovering that there was no way to fail back power without an outage, then them going off and building their own equipment.
> <i>"When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, thus causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure."</i><p>Background note to HN readers:<p>Almost <i>zero</i> SaaS providers (or even CDNs) using the term "our datacenter" or showing their datacenters on maps etc. have their own datacenters. It's universal and normal. In general they have a server, a rack, a cage, in shared space, subject to others' policies and practices, and their neighbors.<p>This can adjust your mental model for accountability and your designs for resilience. You can even exploit this by colo-ing at the same addresses to get LAN latencies to your SaaS provider, CDN, or (sometimes) even cloud provider.
A single k8s cluster spanning multiple datacenters feels mind boggling to me. I know it's not exactly uncommon for HA even if you just have a little one in your cloud provider of choice but I'm sure it's a totally different beast than the toy ones I've created.
Single Point of Failure<p>Is PDX still a single-point-of-failure for Cloudflare services?<p>It was 5-months ago [0], and if I understand the post - it sounds like it still is.<p>If anyone knows, I'd be curious to hear.<p>[0] <a href="https://news.ycombinator.com/item?id=38113503">https://news.ycombinator.com/item?id=38113503</a>
I can very definitely empathise with the experience of having worked hard at fixing the issues underpinning high priority incidents, then noticing that what previously would have taken hours to fix is now only visible as a blip on a graph.