Before they'd be affected by Route 53 outages, Cloudfront outages, and S3 outages. Now they can add Lambda outages to that list too.<p>It's also unclear how this actually solves the problem. Now if S3 in _either_ region is unavailable they'll start to fail 50% of uncached requests. I'm guessing they're using Route 53 health checks with some cloudwatch alarm to cut over to one region when they think the other is unhealthy. Presumably this is covered in the unavailable part 2.<p>I'm mildly skeptical that this is worth the increased risks plus the increased cost from running Lambda@Edge on cache misses.
If the "Cross-region replication" line in the picture is talking about the native S3 cross-region replication (as I assume it is), beware the replication latency in this setup. AWS recently released "replication with an SLA" for S3 [0], but at "99.99% of the objects will be replicated within 15 minutes", it's not a good enough SLA to rely on in setups like this.<p>Presumably Part 2 of this post will address this limitation, or maybe their product isn't affected by it. (I've never looked into Contentful, though maybe I will now -- blog post purpose achieved?)<p>I'm also not sure if "active-active" is the best name for this setup, since objects can't be written to the 2nd bucket (replication only goes one direction). Generally I associate "active-active" with "writes can happen anywhere", though maybe I'm wrong?<p>[0] <a href="https://aws.amazon.com/blogs/aws/s3-replication-update-replication-sla-metrics-and-events/" rel="nofollow">https://aws.amazon.com/blogs/aws/s3-replication-update-repli...</a>
Confused - why not use CloudFront Origin Groups? <a href="https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/high_availability_origin_failover.html" rel="nofollow">https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...</a><p>Full disclosure, I've never used, but pretty sure this feature was created for the scenario you are trying to solve.
Although it may make sense to this company in _majority_ of companies this would be over-engineering. S3 availability is some of the best in the business. If S3 is down, a good chunk of the internet is down with it.
The architecture described here is pretty simple. The article states the fix was 20 lines of code. If this is the hardest problem you have to solve at work, I envy you.