This is very interesting. From the little I understand (sorry for using AWS terms as I am more versed with AWS than GCE) this can happen to AWS as well right? even if your software is deployed to multiple AZs / multiple regions, if bad routing / network configuration makes it through the various protection mechanisms then basically no amount of redundancy can help if your service is part of the non functional IP block. I mean it seems no matter how redundant you are, there will always be somewhere along the line a single point of failure, even if it has multiple mechanism to prevent it from happening, if all of these mechanisms fail, then it's still a single point. What prevents this from happening at Azure / AWS? Is there anything that general internet routing protocols need to change to prevent it from happening?<p>e.g. I'm sure that we will never hear that Bank of X has transferred a billion dollar to an account but because of propagation errors it published only the credit but didn't finish the debit and now we have two billionaires. This two or more phase commit is pretty much bulletproof in banking as far as I know, and banks are not known to be technologically more advanced than Google, how come internet routing is so prone to errors that can an entire cloud service unavailable for even a small period of time?
I'm far from knowing much about networking (although I took some graduate networking courses, I still feel I know practically nothing about it...)
So I would appreciate if someone versed in this ELI5 whether it can happen in AWS and Azure regardless of how redundant you are, (which leads to a notion of cross cloud provider redundancy which I'm sure is used in some places) and whether the banking analogy is fair and relevant, and if there are any RFCs to make world-blackout routing nightmares less likely to happen.