TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Major data center power failure (again): Cloudflare Code Orange tested

214 pointsby gmemstrabout 1 year ago

8 comments

decasiaabout 1 year ago
As always, it&#x27;s really impressive to see how much technical detail they release publicly in their RCAs. It sets a good example for the industry.<p>Also — quite impressive to make major infrastructure and architecture changes in a few months. Not every organization can pull that off.
评论 #39975377 未加载
评论 #39973715 未加载
评论 #39984207 未加载
评论 #39980159 未加载
qmarchiabout 1 year ago
Two major outages less than half a year a part, but with wildly different outcomes. It&#x27;s definitely showing their engineering capabilities were targeted at the correct outcomes.<p>Would definitely be interested to see the detailed RCA on the power side of things. Not many people really think about Layer 0 on the stack.
评论 #39974108 未加载
mannyvabout 1 year ago
Someone set up the breakers incorrectly way back when, and they were never adjusted. I&#x27;ll bet it&#x27;s not possible to adjust those without powering off the downstream equipment.<p>It reminds me of the amazon guy discovering that there was no way to fail back power without an outage, then them going off and building their own equipment.
评论 #39972841 未加载
评论 #39975020 未加载
Terrettaabout 1 year ago
&gt; <i>&quot;When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, thus causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure.&quot;</i><p>Background note to HN readers:<p>Almost <i>zero</i> SaaS providers (or even CDNs) using the term &quot;our datacenter&quot; or showing their datacenters on maps etc. have their own datacenters. It&#x27;s universal and normal. In general they have a server, a rack, a cage, in shared space, subject to others&#x27; policies and practices, and their neighbors.<p>This can adjust your mental model for accountability and your designs for resilience. You can even exploit this by colo-ing at the same addresses to get LAN latencies to your SaaS provider, CDN, or (sometimes) even cloud provider.
llbeansandriceabout 1 year ago
A single k8s cluster spanning multiple datacenters feels mind boggling to me. I know it&#x27;s not exactly uncommon for HA even if you just have a little one in your cloud provider of choice but I&#x27;m sure it&#x27;s a totally different beast than the toy ones I&#x27;ve created.
评论 #39975587 未加载
评论 #39975617 未加载
评论 #39974926 未加载
alberthabout 1 year ago
Single Point of Failure<p>Is PDX still a single-point-of-failure for Cloudflare services?<p>It was 5-months ago [0], and if I understand the post - it sounds like it still is.<p>If anyone knows, I&#x27;d be curious to hear.<p>[0] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38113503">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38113503</a>
评论 #39975203 未加载
评论 #39975171 未加载
andrewaylettabout 1 year ago
I can very definitely empathise with the experience of having worked hard at fixing the issues underpinning high priority incidents, then noticing that what previously would have taken hours to fix is now only visible as a blip on a graph.
Jamie9912about 1 year ago
Interesting, I didn&#x27;t even hear about that second outage
评论 #39976403 未加载