TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Nov 16 GCP Load Balancing Incident Report

172 pointsby joshmaover 3 years ago

12 comments

darkwaterover 3 years ago
&quot;Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted.&quot;<p>This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn&#x27;t really fix the bug is one of the more human things that can happen. I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.
评论 #29318560 未加载
throwoutwayover 3 years ago
Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.<p>I’m not good with statistics but what are the chances?
评论 #29316263 未加载
评论 #29315454 未加载
评论 #29316420 未加载
评论 #29314961 未加载
评论 #29317376 未加载
评论 #29318720 未加载
评论 #29314945 未加载
评论 #29314601 未加载
评论 #29316183 未加载
评论 #29317070 未加载
评论 #29316239 未加载
评论 #29314554 未加载
评论 #29314585 未加载
htrpover 3 years ago
Did Roblox ever release the incident report from their outage?
评论 #29315807 未加载
chairmanwow1over 3 years ago
Not sure if this is my own personal bias, but I could have sworn this issue was effecting traffic for longer.<p>My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.<p>Anyone else have stabilizing ancedata?
评论 #29315263 未加载
评论 #29315761 未加载
评论 #29314952 未加载
评论 #29314852 未加载
bullenover 3 years ago
This is my experience of the outage: My DNS servers stopped working but HTTP was operational if I used the IP, so something is rotten with this report.<p>Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.<p>Europe is a non-issue for hosting because it&#x27;s where I live and services are plentiful.<p>I&#x27;m going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.<p>Enough of this external dependency crap!
评论 #29320937 未加载
breakingcupsover 3 years ago
What I would not give for a comprehensive leak of Google&#x27;s major internal post-mortems.
gigatexalover 3 years ago
I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.
评论 #29315168 未加载
londons_exploreover 3 years ago
This text has been rewritten for public consumption in quite a positive light... There are far mode details and contributing factors, and only the best narrative will have been selected for publication here.
评论 #29314924 未加载
stevefan1999over 3 years ago
one bug fixed, two bugs introduced...
m0zgover 3 years ago
&gt; customers affected by the outage _may have_ encountered 404 errors<p>&gt; for the inconvenience this service outage _may have_ caused<p>Not a fan of this language guys&#x2F;gals. You&#x27;ve done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.
评论 #29320706 未加载
评论 #29318611 未加载
SteveNutsover 3 years ago
Is there any possibility that data POSTed during that outage would have leaked some pretty sensitive data?<p>For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?<p>At that scale there has to be many examples of similar issues, no?
评论 #29315604 未加载
londons_exploreover 3 years ago
This to me shows Google hasn&#x27;t gotten in place sufficient monitoring to know the <i>scale</i> of problems and the correct scale of response.<p>For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)<p>Whereas if there is a 100% outage, it makes sense to do an &quot;insta-nuke-and-restart-everything&quot;, taking perhaps 15 seconds.<p>Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)
评论 #29314780 未加载
评论 #29314973 未加载
评论 #29314910 未加载
评论 #29314598 未加载
评论 #29315104 未加载