I run my infrastructure on three different providers and use GeoIP assigned AnyCast DNS servers from another provider.<p>Asia/Australia is run on Digital Ocean, Europe is on OVH, and the Americas is on AWS.<p>When someone requests the IP address of my site's front-end domain or static asset CDN domain, my nameserver determines their geographic location and returns the IP address of the closest resources to them.<p>I run health checks so when S3 went down, which I use to host my static assets for the Americas, my nameservers quit giving out the IP addresses for the Americas systems and started giving out IP addresses for the Europe systems.<p>When health checks started being successful again, everything restored itself.<p>Due to low DNS TTL values, users in the Americas were only impacted for a few minutes and that's if the IP was cached by their system.
We host a number of our customers' database systems on us-east-1.<p>What worked well for us (<a href="https://aiven.io" rel="nofollow">https://aiven.io</a>):<p>- Architecturally relying only to a few cloud provider services (only need VMs, disk, object storage)<p>- Upfront investment on being able to move services from one region to another without downtime<p>- Pre-existing tooling for easily (manually) reconfiguring backup destinations on the fly<p>- Not running everything on just AWS<p>What did not work so well:<p>- Backups should automatically reroute to a secondary backup site on N consecutive failures<p>- Alert spam, need more aggregation<p>- New failure mode: extremely slow EBS access, some affected VMs were kinda working, but very slowly: need to create a separate alert trigger for this