Just wanted to add a quick note before we get the usual deluge of "you should be running in multiple AZs and regions" posts: These outages are relatively rare and your best decision might just be to accept the tiny amount of downtime and keep your app simple and inexpensive to run.<p>I of course don't know the tradeoffs involved in running your system, but I know for a lot of my situations the simplicity of single AZ with a straightforward failover option is usually the right tradeoff.
<a href="https://status.heroku.com/incidents/1892" rel="nofollow">https://status.heroku.com/incidents/1892</a> - it appears Heroku is being particularly affected. We've had multiple sites on multiple accounts go down in the past few minutes.<p>EDIT T16:31Z: It appears Heroku has failed over their dashboard, but dynos are still failing to come online. We had assumed that they had multi-region failovers for their customers. Incredibly disappointing.
Looks to have been caused by a loss of utility power and subsequent backup generator failure at one datacenter.<p>> 10:47 AM PDT We want to give you more information on progress at this point, and what we know about the event. At 4:33 AM PDT one of 10 datacenters in one of the 6 Availability Zones in the US-EAST-1 Region saw a failure of utility power. Backup generators came online immediately, but for reasons we are still investigating, began quickly failing at around 6:00 AM PDT. This resulted in 7.5% of all instances in that Availability Zone failing by 6:10 AM PDT. Over the last few hours we have recovered most instances but still have 1.5% of the instances in that Availability Zone remaining to be recovered. Similar impact existed to EBS and we continue to recover volumes within EBS. New instance launches in this zone continue to work without issue.<p><a href="https://status.aws.amazon.com/rss/ec2-us-east-1.rss" rel="nofollow">https://status.aws.amazon.com/rss/ec2-us-east-1.rss</a>
I got paged 50 minutes before AWS updated their status page. We are running on AWS's managed Kubernetes offering (EKS), and about one third of our nodes were running in the affected availability zone. We were then able to move all of or traffic out of that AZ, which solved our issues. The main symptom was HTTP requests made by our backend to 3rd party APIs failing, but only on requests originating from that AZ.
Amazon JUST had an ec2/RDS failure in one AZ in Tokyo last week; the cause was a bug in their HVAC that led to overheating. I wonder if this is similar or just coincidental.<p><a href="https://aws.amazon.com/jp/message/56489/" rel="nofollow">https://aws.amazon.com/jp/message/56489/</a>
The Spinnaker project is looking more appealing with every outage. Outage detected in X provider in Y region? Deploy infrastructure to Z provider in Y region.
us-east-1 continues to have continually worse uptime than other regions (for, likely, good reason too, it continues to be the default region).<p>I've avoided that region and I can't remember the last time I had downtime caused by Amazon.
Leaseweb Virginia is having a major outage as well. Maybe it is related?<p><a href="https://www.leasewebstatus.com/incidents/updated-connectivity-issues-in-part-of-our-network/ci25t2jr" rel="nofollow">https://www.leasewebstatus.com/incidents/updated-connectivit...</a>
This seems to affect a broad swath of the internet, perhaps because the us-east-1 region is so popular? My side project StatusGator shows approximately 15% of the status pages we monitor (including our own) with a warn or down notice right now, a sizable spike over the baseline.
>We are investigating connectivity issues affecting some instances in a single Availability Zone in the US-EAST-1 Region.<p>Well there’s your problem, people. Use multiple AZs.
This is pretty good common sense post on not having your failure moods correlate with your client's failure modes.<p><a href="https://trackjs.com/blog/separate-monitoring/" rel="nofollow">https://trackjs.com/blog/separate-monitoring/</a><p>I don't work for any of the entities mentioned.
For folks here, my RDS instances in us-east-1f are doing okay (knock on wood!) Not sure which AZ is suffering most.<p>My client's Heroku instances are online, thankfully.<p>Can anyone here speak to their experience with the Ohio region? I'm considering leaning on that more and more.
Is there no way at all to reach Amazon EC2 instances in us-east-1 or is just the default route to the internet broken?<p>Is there any way for the owners of the instances to reach them?
I'm in Australia and Reddit/Twitter ground to a standstill - request timeout after request timeout. I presumed it was an outage somewhere but was surprised to learn it was with AWS us-east-1? I would have thought surely that my connection would have referenced a different region based on my location.
My little instance died and I had to bring it back from the image.<p>Glad to know that it wasn't anything personal over any Hacker News gags I've done.
Well, this outage says something about the companies that religiously depend on it.<p>If your entire service just went down as soon as this happened, Congratulations! You didn't deploy in multiple regions or think about a failsafe/fallback option that redirects from your affected service or instance.