科技回声

12 条评论

darkwater超过 3 年前

"Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted."This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn't really fix the bug is one of the more human things that can happen. I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.

评论 #29318560 未加载

throwoutway超过 3 年前

Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.I’m not good with statistics but what are the chances?

评论 #29316263 未加载

评论 #29315454 未加载

评论 #29316420 未加载

评论 #29314961 未加载

评论 #29317376 未加载

评论 #29318720 未加载

评论 #29314945 未加载

评论 #29314601 未加载

评论 #29316183 未加载

评论 #29317070 未加载

评论 #29316239 未加载

评论 #29314554 未加载

评论 #29314585 未加载

htrp超过 3 年前

Did Roblox ever release the incident report from their outage?

评论 #29315807 未加载

chairmanwow1超过 3 年前

Not sure if this is my own personal bias, but I could have sworn this issue was effecting traffic for longer.My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.Anyone else have stabilizing ancedata?

评论 #29315263 未加载

评论 #29315761 未加载

评论 #29314952 未加载

评论 #29314852 未加载

bullen超过 3 年前

This is my experience of the outage: My DNS servers stopped working but HTTP was operational if I used the IP, so something is rotten with this report.Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.Europe is a non-issue for hosting because it's where I live and services are plentiful.I'm going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.Enough of this external dependency crap!

评论 #29320937 未加载

breakingcups超过 3 年前

What I would not give for a comprehensive leak of Google's major internal post-mortems.

gigatexal超过 3 年前

I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.

评论 #29315168 未加载

londons_explore超过 3 年前

This text has been rewritten for public consumption in quite a positive light... There are far mode details and contributing factors, and only the best narrative will have been selected for publication here.

评论 #29314924 未加载

stevefan1999超过 3 年前

one bug fixed, two bugs introduced...

m0zg超过 3 年前

> customers affected by the outage _may have_ encountered 404 errors> for the inconvenience this service outage _may have_ causedNot a fan of this language guys/gals. You've done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.

评论 #29320706 未加载

评论 #29318611 未加载

SteveNuts超过 3 年前

Is there any possibility that data POSTed during that outage would have leaked some pretty sensitive data?For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?At that scale there has to be many examples of similar issues, no?

评论 #29315604 未加载

londons_explore超过 3 年前

This to me shows Google hasn't gotten in place sufficient monitoring to know the scale of problems and the correct scale of response.For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)

评论 #29314780 未加载

评论 #29314973 未加载

评论 #29314910 未加载

评论 #29314598 未加载

评论 #29315104 未加载

12 条评论

darkwater超过 3 年前

评论 #29318560 未加载

throwoutway超过 3 年前

评论 #29316263 未加载

评论 #29315454 未加载

评论 #29316420 未加载

评论 #29314961 未加载

评论 #29317376 未加载

评论 #29318720 未加载

评论 #29314945 未加载

评论 #29314601 未加载

评论 #29316183 未加载

评论 #29317070 未加载

评论 #29316239 未加载

评论 #29314554 未加载

评论 #29314585 未加载

htrp超过 3 年前

Did Roblox ever release the incident report from their outage?

评论 #29315807 未加载

chairmanwow1超过 3 年前

评论 #29315263 未加载

评论 #29315761 未加载

评论 #29314952 未加载

评论 #29314852 未加载

bullen超过 3 年前

评论 #29320937 未加载

breakingcups超过 3 年前

What I would not give for a comprehensive leak of Google's major internal post-mortems.

gigatexal超过 3 年前

I find the post mortem really humanizing. As a customer of GCP there’s no love lost on my end.

Nov 16 GCP Load Balancing Incident Report

12 条评论

Nov 16 GCP Load Balancing Incident Report

12 条评论