TechEcho

12 comments

darkwaterover 3 years ago

"Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted."This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn't really fix the bug is one of the more human things that can happen. I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.

评论 #29318560 未加载

throwoutwayover 3 years ago

Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.I’m not good with statistics but what are the chances?

评论 #29316263 未加载

评论 #29315454 未加载

评论 #29316420 未加载

评论 #29314961 未加载

评论 #29317376 未加载

评论 #29318720 未加载

评论 #29314945 未加载

评论 #29314601 未加载

评论 #29316183 未加载

评论 #29317070 未加载

评论 #29316239 未加载

评论 #29314554 未加载

评论 #29314585 未加载

htrpover 3 years ago

Did Roblox ever release the incident report from their outage?

评论 #29315807 未加载

chairmanwow1over 3 years ago

Not sure if this is my own personal bias, but I could have sworn this issue was effecting traffic for longer.My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.Anyone else have stabilizing ancedata?

评论 #29315263 未加载

评论 #29315761 未加载

评论 #29314952 未加载

评论 #29314852 未加载

bullenover 3 years ago

This is my experience of the outage: My DNS servers stopped working but HTTP was operational if I used the IP, so something is rotten with this report.Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.Europe is a non-issue for hosting because it's where I live and services are plentiful.I'm going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.Enough of this external dependency crap!

评论 #29320937 未加载

breakingcupsover 3 years ago

What I would not give for a comprehensive leak of Google's major internal post-mortems.

gigatexalover 3 years ago

I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.

评论 #29315168 未加载

londons_exploreover 3 years ago

This text has been rewritten for public consumption in quite a positive light... There are far mode details and contributing factors, and only the best narrative will have been selected for publication here.

评论 #29314924 未加载

stevefan1999over 3 years ago

one bug fixed, two bugs introduced...

m0zgover 3 years ago

> customers affected by the outage _may have_ encountered 404 errors> for the inconvenience this service outage _may have_ causedNot a fan of this language guys/gals. You've done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.

评论 #29320706 未加载

评论 #29318611 未加载

SteveNutsover 3 years ago

Is there any possibility that data POSTed during that outage would have leaked some pretty sensitive data?For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?At that scale there has to be many examples of similar issues, no?

评论 #29315604 未加载

londons_exploreover 3 years ago

This to me shows Google hasn't gotten in place sufficient monitoring to know the scale of problems and the correct scale of response.For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)

评论 #29314780 未加载

评论 #29314973 未加载

评论 #29314910 未加载

评论 #29314598 未加载

评论 #29315104 未加载

12 comments

darkwaterover 3 years ago

评论 #29318560 未加载

throwoutwayover 3 years ago

评论 #29316263 未加载

评论 #29315454 未加载

评论 #29316420 未加载

评论 #29314961 未加载

评论 #29317376 未加载

评论 #29318720 未加载

评论 #29314945 未加载

评论 #29314601 未加载

评论 #29316183 未加载

评论 #29317070 未加载

评论 #29316239 未加载

评论 #29314554 未加载

评论 #29314585 未加载

htrpover 3 years ago

Did Roblox ever release the incident report from their outage?

评论 #29315807 未加载

chairmanwow1over 3 years ago

评论 #29315263 未加载

评论 #29315761 未加载

评论 #29314952 未加载

评论 #29314852 未加载

bullenover 3 years ago

评论 #29320937 未加载

breakingcupsover 3 years ago

What I would not give for a comprehensive leak of Google's major internal post-mortems.

gigatexalover 3 years ago

I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.

Nov 16 GCP Load Balancing Incident Report

12 comments

Nov 16 GCP Load Balancing Incident Report

12 comments