I’ve been running platform teams on aws now for 10 years, and working in aws for 13. For anyone looking for guidance on how to avoid this, here’s the advice I give startups I advise.<p>First, if you can, avoid us-east-1. Yes, you’ll miss new features, but it’s also the least stable region.<p>Second, go multi AZ for production workloads. Safety of your customer’s data is your ethical responsibility. Protect it, back it up, keep it as generally available as is reasonable.<p>Third, you’re gonna go down when the cloud goes down. Not much use getting overly bent out of shape. You can reduce your exposure by just using their core systems (EC2, S3, SQS, LBs, Cloudfrount, RDS, Elasticache). The more systems you use, the less reliable things will be. However, running your own key value store, api gateway, event bud, etc., can also be way less reliable than using their’s. So, realize it’s an operational trade off.<p>Degradation of your app / platform is more likely to come from you than AWS. You’re gonna roll out bad code, break your own infra, overload your own system, way more often than Amazon is gonna go down. If reliability matters to you, start by examining your own practices first before thinking things like multi region or super durable highly replicated systems.<p>This stuff is hard. It’s hard for Amazon engineers. Hard for platform folks at small and mega companies. It’s just, hard. When your app goes down, and so does Disney plus, take some solace that Disney in all their buckets of cash also couldn’t avoid the issue.<p>And, finally, hold cloud providers accountable. If they’re unstable and not providing service you expect, leave. We’ve got tons of great options these days, especially if you don’t care about proprietary solutions.<p>Good luck y’all!