TechEcho

8 comments

decasiaabout 1 year ago

As always, it's really impressive to see how much technical detail they release publicly in their RCAs. It sets a good example for the industry.Also — quite impressive to make major infrastructure and architecture changes in a few months. Not every organization can pull that off.

评论 #39975377 未加载

评论 #39973715 未加载

评论 #39984207 未加载

评论 #39980159 未加载

qmarchiabout 1 year ago

Two major outages less than half a year a part, but with wildly different outcomes. It's definitely showing their engineering capabilities were targeted at the correct outcomes.Would definitely be interested to see the detailed RCA on the power side of things. Not many people really think about Layer 0 on the stack.

评论 #39974108 未加载

mannyvabout 1 year ago

Someone set up the breakers incorrectly way back when, and they were never adjusted. I'll bet it's not possible to adjust those without powering off the downstream equipment.It reminds me of the amazon guy discovering that there was no way to fail back power without an outage, then them going off and building their own equipment.

评论 #39972841 未加载

评论 #39975020 未加载

Terrettaabout 1 year ago

> "When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, thus causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure."Background note to HN readers:Almost zero SaaS providers (or even CDNs) using the term "our datacenter" or showing their datacenters on maps etc. have their own datacenters. It's universal and normal. In general they have a server, a rack, a cage, in shared space, subject to others' policies and practices, and their neighbors.This can adjust your mental model for accountability and your designs for resilience. You can even exploit this by colo-ing at the same addresses to get LAN latencies to your SaaS provider, CDN, or (sometimes) even cloud provider.

llbeansandriceabout 1 year ago

A single k8s cluster spanning multiple datacenters feels mind boggling to me. I know it's not exactly uncommon for HA even if you just have a little one in your cloud provider of choice but I'm sure it's a totally different beast than the toy ones I've created.

评论 #39975587 未加载

评论 #39975617 未加载

评论 #39974926 未加载

alberthabout 1 year ago

Single Point of FailureIs PDX still a single-point-of-failure for Cloudflare services?It was 5-months ago [0], and if I understand the post - it sounds like it still is.If anyone knows, I'd be curious to hear.[0] <a href="https://news.ycombinator.com/item?id=38113503">https://news.ycombinator.com/item?id=38113503</a>

评论 #39975203 未加载

评论 #39975171 未加载

andrewaylettabout 1 year ago

I can very definitely empathise with the experience of having worked hard at fixing the issues underpinning high priority incidents, then noticing that what previously would have taken hours to fix is now only visible as a blip on a graph.

Jamie9912about 1 year ago

Interesting, I didn't even hear about that second outage

评论 #39976403 未加载

8 comments

decasiaabout 1 year ago

评论 #39975377 未加载

评论 #39973715 未加载

评论 #39984207 未加载

评论 #39980159 未加载

qmarchiabout 1 year ago

评论 #39974108 未加载

mannyvabout 1 year ago

评论 #39972841 未加载

评论 #39975020 未加载

Terrettaabout 1 year ago

llbeansandriceabout 1 year ago

评论 #39975587 未加载

评论 #39975617 未加载

评论 #39974926 未加载

alberthabout 1 year ago

评论 #39975203 未加载

评论 #39975171 未加载

andrewaylettabout 1 year ago

Jamie9912about 1 year ago

Interesting, I didn't even hear about that second outage

评论 #39976403 未加载

Major data center power failure (again): Cloudflare Code Orange tested

8 comments

Major data center power failure (again): Cloudflare Code Orange tested

8 comments