AWS EC2/RDS Outage in us-east-1

212 pointsby jacobwgover 5 years ago

30 comments

matt2000over 5 years ago

Just wanted to add a quick note before we get the usual deluge of "you should be running in multiple AZs and regions" posts: These outages are relatively rare and your best decision might just be to accept the tiny amount of downtime and keep your app simple and inexpensive to run.I of course don't know the tradeoffs involved in running your system, but I know for a lot of my situations the simplicity of single AZ with a straightforward failover option is usually the right tradeoff.

评论 #20847393 未加载

评论 #20847319 未加载

评论 #20847285 未加载

评论 #20847280 未加载

评论 #20847357 未加载

评论 #20847675 未加载

评论 #20849378 未加载

评论 #20847951 未加载

评论 #20848291 未加载

btownover 5 years ago

<a href="https://status.heroku.com/incidents/1892" rel="nofollow">https://status.heroku.com/incidents/1892</a> - it appears Heroku is being particularly affected. We've had multiple sites on multiple accounts go down in the past few minutes.EDIT T16:31Z: It appears Heroku has failed over their dashboard, but dynos are still failing to come online. We had assumed that they had multi-region failovers for their customers. Incredibly disappointing.

评论 #20847102 未加载

评论 #20847625 未加载

评论 #20847018 未加载

bombtrackover 5 years ago

Looks to have been caused by a loss of utility power and subsequent backup generator failure at one datacenter.> 10:47 AM PDT We want to give you more information on progress at this point, and what we know about the event. At 4:33 AM PDT one of 10 datacenters in one of the 6 Availability Zones in the US-EAST-1 Region saw a failure of utility power. Backup generators came online immediately, but for reasons we are still investigating, began quickly failing at around 6:00 AM PDT. This resulted in 7.5% of all instances in that Availability Zone failing by 6:10 AM PDT. Over the last few hours we have recovered most instances but still have 1.5% of the instances in that Availability Zone remaining to be recovered. Similar impact existed to EBS and we continue to recover volumes within EBS. New instance launches in this zone continue to work without issue.<a href="https://status.aws.amazon.com/rss/ec2-us-east-1.rss" rel="nofollow">https://status.aws.amazon.com/rss/ec2-us-east-1.rss</a>

bdcravensover 5 years ago

I've noticed both Twitter and Reddit were having issues this morning, so this makes sense.

评论 #20847028 未加载

评论 #20846983 未加载

scott113341over 5 years ago

I got paged 50 minutes before AWS updated their status page. We are running on AWS's managed Kubernetes offering (EKS), and about one third of our nodes were running in the affected availability zone. We were then able to move all of or traffic out of that AZ, which solved our issues. The main symptom was HTTP requests made by our backend to 3rd party APIs failing, but only on requests originating from that AZ.

groundlogicover 5 years ago

Reddit has been quite dysfunctional for me the past hour or so.

评论 #20848731 未加载

评论 #20847213 未加载

评论 #20846790 未加载

评论 #20846813 未加载

sdrothrockover 5 years ago

Amazon JUST had an ec2/RDS failure in one AZ in Tokyo last week; the cause was a bug in their HVAC that led to overheating. I wonder if this is similar or just coincidental.<a href="https://aws.amazon.com/jp/message/56489/" rel="nofollow">https://aws.amazon.com/jp/message/56489/</a>

评论 #20847295 未加载

xystover 5 years ago

The Spinnaker project is looking more appealing with every outage. Outage detected in X provider in Y region? Deploy infrastructure to Z provider in Y region.

评论 #20847063 未加载

评论 #20847047 未加载

评论 #20848654 未加载

评论 #20847024 未加载

评论 #20847056 未加载

somehowadevover 5 years ago

I’m surprised by how much of the “internet” seem to be affected by a single AZ going down.

评论 #20847118 未加载

nemothekidover 5 years ago

us-east-1 continues to have continually worse uptime than other regions (for, likely, good reason too, it continues to be the default region).I've avoided that region and I can't remember the last time I had downtime caused by Amazon.

评论 #20846883 未加载

评论 #20847009 未加载

JacobJansover 5 years ago

Leaseweb Virginia is having a major outage as well. Maybe it is related?<a href="https://www.leasewebstatus.com/incidents/updated-connectivity-issues-in-part-of-our-network/ci25t2jr" rel="nofollow">https://www.leasewebstatus.com/incidents/updated-connectivit...</a>

ihaveajobover 5 years ago

Copy that. Happy Labor Day weekend everyone.

评论 #20846817 未加载

colinbartlettover 5 years ago

This seems to affect a broad swath of the internet, perhaps because the us-east-1 region is so popular? My side project StatusGator shows approximately 15% of the status pages we monitor (including our own) with a warn or down notice right now, a sizable spike over the baseline.

rifficover 5 years ago

>We are investigating connectivity issues affecting some instances in a single Availability Zone in the US-EAST-1 Region.Well there’s your problem, people. Use multiple AZs.

评论 #20846876 未加载

评论 #20846868 未加载

crb002over 5 years ago

Curious. Lambda not effected. EC2 being physically tied to a box does introduce extra risk I hadn't thought of.

评论 #20847083 未加载

jgalt212over 5 years ago

This is pretty good common sense post on not having your failure moods correlate with your client's failure modes.<a href="https://trackjs.com/blog/separate-monitoring/" rel="nofollow">https://trackjs.com/blog/separate-monitoring/</a>I don't work for any of the entities mentioned.

abathurover 5 years ago

Had an app doing fine until about 12 minutes ago, when Heroku tried to move it to a new server. Alas.

评论 #20847959 未加载

whalesaladover 5 years ago

For folks here, my RDS instances in us-east-1f are doing okay (knock on wood!) Not sure which AZ is suffering most.My client's Heroku instances are online, thankfully.Can anyone here speak to their experience with the Ohio region? I'm considering leaning on that more and more.

评论 #20848449 未加载

doiwinover 5 years ago

Is there no way at all to reach Amazon EC2 instances in us-east-1 or is just the default route to the internet broken?Is there any way for the owners of the instances to reach them?

shamalingaover 5 years ago

Is this why Reddit and Duolingo weren't working properly? I've had issues since 9pm Sydney time so about 4 hours or so.

karmakazeover 5 years ago

I remember reading about how not all AWS regions are similarly operated and that one was a snowflake. Is it US-East-1?

评论 #20847499 未加载

nrxrover 5 years ago

Has anyone else noticed that there seems to never be outages in us-east-2 and somehow everyone keeps putting instances in -1?Why?

评论 #20847202 未加载

评论 #20847189 未加载

评论 #20847743 未加载

评论 #20847243 未加载

odirootover 5 years ago

Funnily enough Heroku in Europe also seems to be malfunctioning. Cannot deploy my app for at least an hour now.

bjornsteffansonover 5 years ago

I'm in Australia and Reddit/Twitter ground to a standstill - request timeout after request timeout. I presumed it was an outage somewhere but was surprised to learn it was with AWS us-east-1? I would have thought surely that my connection would have referenced a different region based on my location.

评论 #20847108 未加载

patrickaljordover 5 years ago

That must be why reddit and twitter are failing on me.

评论 #20847665 未加载

holykinover 5 years ago

It looks like it was localized to zone D.

评论 #20846886 未加载

评论 #20847343 未加载

beardedmanover 5 years ago

Aha. Experienced some NPM lag too.

fibersover 5 years ago

is that why xda developers doesnt work

smitty1eover 5 years ago

My little instance died and I had to bring it back from the image.Glad to know that it wasn't anything personal over any Hacker News gags I've done.

rvzover 5 years ago

Well, this outage says something about the companies that religiously depend on it.If your entire service just went down as soon as this happened, Congratulations! You didn't deploy in multiple regions or think about a failsafe/fallback option that redirects from your affected service or instance.

评论 #20846918 未加载

评论 #20847011 未加载

评论 #20846951 未加载

评论 #20847062 未加载

评论 #20847036 未加载

评论 #20846896 未加载