TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Major data center power failure (again): Cloudflare Code Orange tested

214 点作者 gmemstr大约 1 年前

8 条评论

decasia大约 1 年前
As always, it&#x27;s really impressive to see how much technical detail they release publicly in their RCAs. It sets a good example for the industry.<p>Also — quite impressive to make major infrastructure and architecture changes in a few months. Not every organization can pull that off.
评论 #39975377 未加载
评论 #39973715 未加载
评论 #39984207 未加载
评论 #39980159 未加载
qmarchi大约 1 年前
Two major outages less than half a year a part, but with wildly different outcomes. It&#x27;s definitely showing their engineering capabilities were targeted at the correct outcomes.<p>Would definitely be interested to see the detailed RCA on the power side of things. Not many people really think about Layer 0 on the stack.
评论 #39974108 未加载
mannyv大约 1 年前
Someone set up the breakers incorrectly way back when, and they were never adjusted. I&#x27;ll bet it&#x27;s not possible to adjust those without powering off the downstream equipment.<p>It reminds me of the amazon guy discovering that there was no way to fail back power without an outage, then them going off and building their own equipment.
评论 #39972841 未加载
评论 #39975020 未加载
Terretta大约 1 年前
&gt; <i>&quot;When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, thus causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure.&quot;</i><p>Background note to HN readers:<p>Almost <i>zero</i> SaaS providers (or even CDNs) using the term &quot;our datacenter&quot; or showing their datacenters on maps etc. have their own datacenters. It&#x27;s universal and normal. In general they have a server, a rack, a cage, in shared space, subject to others&#x27; policies and practices, and their neighbors.<p>This can adjust your mental model for accountability and your designs for resilience. You can even exploit this by colo-ing at the same addresses to get LAN latencies to your SaaS provider, CDN, or (sometimes) even cloud provider.
llbeansandrice大约 1 年前
A single k8s cluster spanning multiple datacenters feels mind boggling to me. I know it&#x27;s not exactly uncommon for HA even if you just have a little one in your cloud provider of choice but I&#x27;m sure it&#x27;s a totally different beast than the toy ones I&#x27;ve created.
评论 #39975587 未加载
评论 #39975617 未加载
评论 #39974926 未加载
alberth大约 1 年前
Single Point of Failure<p>Is PDX still a single-point-of-failure for Cloudflare services?<p>It was 5-months ago [0], and if I understand the post - it sounds like it still is.<p>If anyone knows, I&#x27;d be curious to hear.<p>[0] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38113503">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=38113503</a>
评论 #39975203 未加载
评论 #39975171 未加载
andrewaylett大约 1 年前
I can very definitely empathise with the experience of having worked hard at fixing the issues underpinning high priority incidents, then noticing that what previously would have taken hours to fix is now only visible as a blip on a graph.
Jamie9912大约 1 年前
Interesting, I didn&#x27;t even hear about that second outage
评论 #39976403 未加载