TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Nov 16 GCP Load Balancing Incident Report

172 点作者 joshma超过 3 年前

12 条评论

darkwater超过 3 年前
&quot;Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted.&quot;<p>This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn&#x27;t really fix the bug is one of the more human things that can happen. I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.
评论 #29318560 未加载
throwoutway超过 3 年前
Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.<p>I’m not good with statistics but what are the chances?
评论 #29316263 未加载
评论 #29315454 未加载
评论 #29316420 未加载
评论 #29314961 未加载
评论 #29317376 未加载
评论 #29318720 未加载
评论 #29314945 未加载
评论 #29314601 未加载
评论 #29316183 未加载
评论 #29317070 未加载
评论 #29316239 未加载
评论 #29314554 未加载
评论 #29314585 未加载
htrp超过 3 年前
Did Roblox ever release the incident report from their outage?
评论 #29315807 未加载
chairmanwow1超过 3 年前
Not sure if this is my own personal bias, but I could have sworn this issue was effecting traffic for longer.<p>My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.<p>Anyone else have stabilizing ancedata?
评论 #29315263 未加载
评论 #29315761 未加载
评论 #29314952 未加载
评论 #29314852 未加载
bullen超过 3 年前
This is my experience of the outage: My DNS servers stopped working but HTTP was operational if I used the IP, so something is rotten with this report.<p>Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.<p>Europe is a non-issue for hosting because it&#x27;s where I live and services are plentiful.<p>I&#x27;m going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.<p>Enough of this external dependency crap!
评论 #29320937 未加载
breakingcups超过 3 年前
What I would not give for a comprehensive leak of Google&#x27;s major internal post-mortems.
gigatexal超过 3 年前
I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.
评论 #29315168 未加载
londons_explore超过 3 年前
This text has been rewritten for public consumption in quite a positive light... There are far mode details and contributing factors, and only the best narrative will have been selected for publication here.
评论 #29314924 未加载
stevefan1999超过 3 年前
one bug fixed, two bugs introduced...
m0zg超过 3 年前
&gt; customers affected by the outage _may have_ encountered 404 errors<p>&gt; for the inconvenience this service outage _may have_ caused<p>Not a fan of this language guys&#x2F;gals. You&#x27;ve done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.
评论 #29320706 未加载
评论 #29318611 未加载
SteveNuts超过 3 年前
Is there any possibility that data POSTed during that outage would have leaked some pretty sensitive data?<p>For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?<p>At that scale there has to be many examples of similar issues, no?
评论 #29315604 未加载
londons_explore超过 3 年前
This to me shows Google hasn&#x27;t gotten in place sufficient monitoring to know the <i>scale</i> of problems and the correct scale of response.<p>For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)<p>Whereas if there is a 100% outage, it makes sense to do an &quot;insta-nuke-and-restart-everything&quot;, taking perhaps 15 seconds.<p>Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)
评论 #29314780 未加载
评论 #29314973 未加载
评论 #29314910 未加载
评论 #29314598 未加载
评论 #29315104 未加载