"Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted."<p>This reminds everyone that even the top-notch engineers that work at Google are still humans. A bugfix that didn't really fix the bug is one of the more human things that can happen.
I surely make much more mistakes than the average Google engineer, and my overall output quality is lower but yet, I feel a bit better with myself today.
Strange that the race condition existed for 6 months, and yet manifested during the last 30 minutes of completing the patch to fix it, only four days after discovery.<p>I’m not good with statistics but what are the chances?
Not sure if this is my own personal bias, but I could have sworn this issue was effecting traffic for longer.<p>My company wasn’t effected, so I wasn’t paying close attention to it. I was surprised to read it was only ~90 min that services were unreachable.<p>Anyone else have stabilizing ancedata?
This is my experience of the outage: My DNS servers stopped working but HTTP was operational if I used the IP, so something is rotten with this report.<p>Lesson learned I will switch to AWS in Asia and only use GCP in central US, with GCP as backup in Asia and IONOS in central US.<p>Europe is a non-issue for hosting because it's where I live and services are plentiful.<p>I'm going to pay for a fixed IP on the fiber I can get that on and host the first DNS on my own hardware with lead-acid backup.<p>Enough of this external dependency crap!
This text has been rewritten for public consumption in quite a positive light... There are far mode details and contributing factors, and only the best narrative will have been selected for publication here.
> customers affected by the outage _may have_ encountered 404 errors<p>> for the inconvenience this service outage _may have_ caused<p>Not a fan of this language guys/gals. You've done a doo-doo, and you know exactly what percentage (if not how many exactly) of the requests were 404s and for which customers. Why the weasel language? Own it.
Is there any possibility that data POSTed during that outage would have leaked some pretty sensitive data?<p>For example, I enter my credit card info on Etsy prior to the issue and just as I hit send the payload now gets sent to Google?<p>At that scale there has to be many examples of similar issues, no?
This to me shows Google hasn't gotten in place sufficient monitoring to know the <i>scale</i> of problems and the correct scale of response.<p>For example, if a service has an outage affecting 1% of users in some corner case, it perhaps makes sense to do an urgent rolling restart of the service, perhaps taking 15 minutes. (On top of diagnosis and response times)<p>Whereas if there is a 100% outage, it makes sense to do an "insta-nuke-and-restart-everything", taking perhaps 15 seconds.<p>Obviously the latter is a really large load on all surrounding infrastructure, so needs to be tested properly. But doing so can reduce a 25 minute outage down to just 10.5 minutes (10 minutes to identify the likely part of the service responsible, 30 seconds to do a nuke-everything rollback)