AWS RDS Postmortem: Is AWS Collapsing Under its Own Weight ?

27 pointsby ruben81adelaideabout 3 years ago

Hi everyone,I have been working with AWS with many years and the service over the years has been outstanding. A few months ago I started the migration of (yet another business) to AWS, and we had an incident. This made me think that maybe AWS is starting not to be good any more.I would like to share the postmortem report with the community, and please comment on what you think. I would like to know if we made a fundamental mistake or if AWS is actually degrading.Times are UTC. Personal opinions are removed from the report, just facts are stated.---------- POSTMORTEM REPORTThe project consists on moving several services to AWS. The system consists Services in 1 autoscaling group, and a PostgreSQL Database in RDS.- Sunday 4.30 am: we migrate the PostgreSQL database to RDS. RDS is configured with 200 GB in the disk, the database size is 15 GB.- Sunday 10.17 am: RDS detects that we are running out of space, and decides to grow the database from 200 GB to 999 GB. The RDS auto scaling event startsAt this point the performance if the database is degraded. Alerts are triggered.A test performed from the VPC network with the query "SELECT now()" took 20 seconds and 248 milliseconds.The database performance is that bad that many of the services goes down.- The RDS auto scaling event finalized at 14:39.After contacting with AWS Support (see details below) we decided to roll back.We contacted with AWS Support (Business). The points more important in the transcript are:- The fact that RDS decided to grow the disk from 200 GB to 999 GB, when the actual database size is 15 GB, it is not a problem.- In the actual auto-scaling event a performance degradation while the auto scaling event is operational is expected. As "sessions"(1) are not being drop, from AWS the database is considered online so it is working as per expectations.- Pointed out the example of a SELECT now() taking 20 seconds. This did not change the fact that the database is online and all is good.- When asked for an estimation of the duration of the event, it was stated that could take "from several minutes to several days".- The objective of AWS Support is to communicate what is happening (implies that you should not expect them to help you to fix the actual problem)-------Opinions ?

7 comments

fxtentacleabout 3 years ago

To me, this reads like they are trying to tell you to get lost in nice words. Most likely, you're not spending $1+ mio annually so they just don't care about you or your experience.AWS is built to support the biggest cloud empires on the planet. If you're not working on that scale, a smaller provider will likely provide much more personal attention, so better support, better tuning, and potentially better bang per buck.But I wouldn't call AWS collapsing. It's operating as designed. The issue is that too many small companies bought into their "stand on the shoulders of giants" marketing and then convinced themselves that they need planet-scale whatever when really they don't. If you're small and nimble, you want small and nimble solutions, too.

评论 #31233021 未加载

singronabout 3 years ago

It's very difficult to manage IOPS for RDS. IOPS scales with RDS disk size to a certain point, so you might not have had nearly enough IOPS at only 200GB. It's possible that you used too small of an instance type since you should probably use an instance that can keep all 15GB in memory and incur only write IOPS from WAL writes, bgwriter, and checkpointing.Even a single nvme ssd can greatly exceed the maximum IOPS available to RDS, so you have to be very careful migrating database workloads to RDS.Once your workload exceeds available throughout, latency will be terrible.These are all hard lessons we've learned on our own. Support will not help you if you have terrible performance issues with RDS. They love to tell you to try optimizing your queries. AWS could afford to staff multiple dedicated support positions for our account, but they pocket the money instead and give terrible canned responses. If you have a tiny account, they definitely won't help you. Some people say support is great, but I've never had a good experience.

synicalxabout 3 years ago

Something definitely seems off here, from the fact that RDS chose to scale to the response you got from support (who've always been... mostly ok in my experience).First of all I think I'd like to find out a little more context before I jump to any conclusions;- Does Cloudwatch confirm that RDS "needed" to scale? And if it does, do any other metrics increase simultaneously with an increase in the storage being used?- Other than the (presumably) one change made to that RDS instance at ~4.30AM, were there any other changes made to it, specifically to its storage, prior to the autoscaling event?- Had this service been tested on RDS prior to the migration being performed?- Were any other changes made, that may potentially effect your DB? For example a query being changed or something of that nature.To me, it sounds like support found something that suggested whatever was happening to RDS at the time was "someone else's problem" under their shared responsibility model. Whether or not that's true - who knows but from how you've described it they definitely seem to be trying to palm you off, worth mentioning to your TAM/rep if you have one because this is pretty poor service.

评论 #31231908 未加载

evil-oliveabout 3 years ago

is this RDS Aurora or RDS "classic"?classic RDS is essentially control-plane-only - AWS spins up an EC2 instance on your behalf, installs the database software, configures replication, etc etc. but on the data plane, you're connecting to a more or less stock Postgres or MySQL instance.Aurora uses a more modern distributed design [0] akin to Spanner, CockroachDB, etc. they implemented their own quorum-based log layer as a storage backend, so it is now involved in both the control plane and data plane.classic RDS has been around long enough, and in essentially maintenance mode without many new features added, that I wouldn't expect weird behavior like this from it. so I'm guessing Aurora.YMMV, but personally I don't trust Aurora for databases I administrate. I use classic RDS for small stuff, but if I needed availability & throughput beyond what classic RDS with its single-node / scale-up-only model can offer, I would reach for CockroachDB or Cassandra or something similar.also if your database size is only 15gb you definitely don't need (and are overpaying) for a 200gb Aurora instance. your data fits in RAM on a mid-sized RDS instance.0: <a href="https://www.allthingsdistributed.com/2019/03/amazon-aurora-design-cloud-native-relational-database.html" rel="nofollow">https://www.allthingsdistributed.com/2019/03/amazon-aurora-d...</a>

评论 #31232321 未加载

tus666about 3 years ago

Decline is the wrong word.It is getting too big for it's own good. It's collapsing under it's own weight.Time is ripe for a smaller, sharper, leaner startup to do things differently and better.

评论 #31231870 未加载

everfrustratedabout 3 years ago

You don't mention the type of disks you're using with RDS. Disk scaling/changing operations also consume IOPS to do their work which can get you into trouble if you, say, run out of burstable IOPS but RDS is still partway through a background mirror operation.If you have RDS Enhanced Monitoring turned on do look at these metrics. They're a bit trickier to use but as they come from an agent on the instance rather than the hypervisor can sometimes help debug.Like any company sometimes the support agent you get might be having an off day. This is where having an AWS TAM is helpful - you can get them to re-esculate the support issue for a second opinion.

评论 #31232298 未加载

stuntkiteabout 3 years ago

Yes.