科技回声

8 条评论

decasia大约 1 年前

As always, it's really impressive to see how much technical detail they release publicly in their RCAs. It sets a good example for the industry.Also — quite impressive to make major infrastructure and architecture changes in a few months. Not every organization can pull that off.

评论 #39975377 未加载

评论 #39973715 未加载

评论 #39984207 未加载

评论 #39980159 未加载

qmarchi大约 1 年前

Two major outages less than half a year a part, but with wildly different outcomes. It's definitely showing their engineering capabilities were targeted at the correct outcomes.Would definitely be interested to see the detailed RCA on the power side of things. Not many people really think about Layer 0 on the stack.

评论 #39974108 未加载

mannyv大约 1 年前

Someone set up the breakers incorrectly way back when, and they were never adjusted. I'll bet it's not possible to adjust those without powering off the downstream equipment.It reminds me of the amazon guy discovering that there was no way to fail back power without an outage, then them going off and building their own equipment.

评论 #39972841 未加载

评论 #39975020 未加载

Terretta大约 1 年前

> "When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, thus causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure."Background note to HN readers:Almost zero SaaS providers (or even CDNs) using the term "our datacenter" or showing their datacenters on maps etc. have their own datacenters. It's universal and normal. In general they have a server, a rack, a cage, in shared space, subject to others' policies and practices, and their neighbors.This can adjust your mental model for accountability and your designs for resilience. You can even exploit this by colo-ing at the same addresses to get LAN latencies to your SaaS provider, CDN, or (sometimes) even cloud provider.

llbeansandrice大约 1 年前

A single k8s cluster spanning multiple datacenters feels mind boggling to me. I know it's not exactly uncommon for HA even if you just have a little one in your cloud provider of choice but I'm sure it's a totally different beast than the toy ones I've created.

评论 #39975587 未加载

评论 #39975617 未加载

评论 #39974926 未加载

alberth大约 1 年前

Single Point of FailureIs PDX still a single-point-of-failure for Cloudflare services?It was 5-months ago [0], and if I understand the post - it sounds like it still is.If anyone knows, I'd be curious to hear.[0] <a href="https://news.ycombinator.com/item?id=38113503">https://news.ycombinator.com/item?id=38113503</a>

评论 #39975203 未加载

评论 #39975171 未加载

andrewaylett大约 1 年前

I can very definitely empathise with the experience of having worked hard at fixing the issues underpinning high priority incidents, then noticing that what previously would have taken hours to fix is now only visible as a blip on a graph.

Jamie9912大约 1 年前

Interesting, I didn't even hear about that second outage

评论 #39976403 未加载

8 条评论

decasia大约 1 年前

评论 #39975377 未加载

评论 #39973715 未加载

评论 #39984207 未加载

评论 #39980159 未加载

qmarchi大约 1 年前

评论 #39974108 未加载

mannyv大约 1 年前

评论 #39972841 未加载

评论 #39975020 未加载

Terretta大约 1 年前

llbeansandrice大约 1 年前

评论 #39975587 未加载

评论 #39975617 未加载

评论 #39974926 未加载

alberth大约 1 年前

评论 #39975203 未加载

评论 #39975171 未加载

andrewaylett大约 1 年前

Jamie9912大约 1 年前

Interesting, I didn't even hear about that second outage

评论 #39976403 未加载

Major data center power failure (again): Cloudflare Code Orange tested

8 条评论

Major data center power failure (again): Cloudflare Code Orange tested

8 条评论