Summary of the AWS Service Event in the Northern Virginia (US-East-1) Region

638 点作者 eigen-vector超过 3 年前

49 条评论

jetru超过 3 年前

Complex systems are really really hard. I'm not a big fan of seeing all these folks bash AWS for this, and not really understanding the complexity or nastiness of situations like this. Running the kind of services they do for the kind of customers, this is a VERY hard problem.We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky.Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head.Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone.

评论 #29520634 未加载

评论 #29519280 未加载

评论 #29519575 未加载

评论 #29520268 未加载

评论 #29519305 未加载

评论 #29520766 未加载

评论 #29520552 未加载

评论 #29520256 未加载

评论 #29524799 未加载

评论 #29519508 未加载

评论 #29519695 未加载

评论 #29520158 未加载

评论 #29520369 未加载

azundo超过 3 年前

> This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.I remember my first experience realizing the client retry logic we had implemented was making our lives way worse. Not sure if it's heartening or disheartening that this was part of the issue here.Our mistake was resetting the exponential backoff delay whenever a client successfully connected and received a response. At the time a percentage but not all responses were degraded and extremely slow, and the request that checked the connection was not. So a client would time out, retry for a while, backing off exponentially, eventually successfully reconnect and then after a subsequent failure start aggressively trying again. System dynamics are hard.

评论 #29518638 未加载

评论 #29518852 未加载

评论 #29519642 未加载

DenisM超过 3 年前

> Customers accessing Amazon S3 and DynamoDB were not impacted by this event.We've seen plenty of S3 errors during that period. Kind of undermines credibility of this report.

评论 #29517047 未加载

评论 #29518157 未加载

评论 #29520511 未加载

评论 #29517773 未加载

tyingq超过 3 年前

"Amazon Secure Token Service (STS) experienced elevated latencies"I was getting 503 "service unavailable" from STS during the outage most of the time I tried calling it.I guess by "elevated latency", they mean from anyone with retry logic that would keep trying after many consecutive attempts?

评论 #29518529 未加载

评论 #29518481 未加载

评论 #29520819 未加载

divbzero超过 3 年前

> This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.Disruption of the standard incident response mechanism seems to be a common element of longer lasting incidents.

评论 #29518899 未加载

评论 #29522525 未加载

评论 #29519741 未加载

mperham超过 3 年前

I wish it contained actual detail and wasn’t couched in generalities.

评论 #29517732 未加载

propter_hoc超过 3 年前

Does anyone know how often an AZ experiences an issue as compared to an entire region? AWS sells the redundancy of AZs pretty heavily, but it seems like a lot of the issues that happen end up being region-wide. I'm struggling to understand whether I should be replicating our service across regions or whether the AZ redundancy within a region is sufficient.

评论 #29516832 未加载

评论 #29516681 未加载

评论 #29517023 未加载

评论 #29517255 未加载

评论 #29517724 未加载

评论 #29517586 未加载

评论 #29516669 未加载

wjossey超过 3 年前

I’ve been running platform teams on aws now for 10 years, and working in aws for 13. For anyone looking for guidance on how to avoid this, here’s the advice I give startups I advise.First, if you can, avoid us-east-1. Yes, you’ll miss new features, but it’s also the least stable region.Second, go multi AZ for production workloads. Safety of your customer’s data is your ethical responsibility. Protect it, back it up, keep it as generally available as is reasonable.Third, you’re gonna go down when the cloud goes down. Not much use getting overly bent out of shape. You can reduce your exposure by just using their core systems (EC2, S3, SQS, LBs, Cloudfrount, RDS, Elasticache). The more systems you use, the less reliable things will be. However, running your own key value store, api gateway, event bud, etc., can also be way less reliable than using their’s. So, realize it’s an operational trade off.Degradation of your app / platform is more likely to come from you than AWS. You’re gonna roll out bad code, break your own infra, overload your own system, way more often than Amazon is gonna go down. If reliability matters to you, start by examining your own practices first before thinking things like multi region or super durable highly replicated systems.This stuff is hard. It’s hard for Amazon engineers. Hard for platform folks at small and mega companies. It’s just, hard. When your app goes down, and so does Disney plus, take some solace that Disney in all their buckets of cash also couldn’t avoid the issue.And, finally, hold cloud providers accountable. If they’re unstable and not providing service you expect, leave. We’ve got tons of great options these days, especially if you don’t care about proprietary solutions.Good luck y’all!

评论 #29518650 未加载

评论 #29518214 未加载

评论 #29518719 未加载

评论 #29518995 未加载

评论 #29519547 未加载

评论 #29518652 未加载

评论 #29518743 未加载

评论 #29520585 未加载

评论 #29518749 未加载

almostdeadguy超过 3 年前

> The AWS container services, including Fargate, ECS and EKS, experienced increased API error rates and latencies during the event. While existing container instances (tasks or pods) continued to operate normally during the event, if a container instance was terminated or experienced a failure, it could not be restarted because of the impact to the EC2 control plane APIs described above.This seems pretty obviously false to me. My company has several EKS clusters in us-east-1 with most of our workloads running on Fargate. All of our Fargate pods were killed and were unable to be restarted during this event.

评论 #29517492 未加载

bamboozled超过 3 年前

Still doesn’t explain the cause of all the IAM permission denied requests we saw against policies which are again working fine without any intervention.Obviously networking issues can cause any number of symptoms but it seems like an unusual detail to leave out to me. Unless it was another ongoing outage happening at the same time.

评论 #29516811 未加载

评论 #29516780 未加载

Ensorceled超过 3 年前

There are a lot of comments in here that boil down to "could you do infrastructure better?"No, absolutely not. That's why I'm on AWS.But what we are all ACTUALLY complaining about is ongoing lack of transparent and honest communications during outages and, clearly, in their postmortems.Honest communications? Yeah, I'm pretty sure I could do that much better than AWS.

raffraffraff超过 3 年前

Something they didn't mention is AWS Billing alarms. These rely on metrics systems which were affected by this (and are missing some data). Crucially, billing alarms only exist in the us-east-1 region, so if you're using them, your impacted no matter where you're infrastructure is deployed. (That's just my reading of it)

simlevesque超过 3 年前

> Customers accessing Amazon S3 and DynamoDB were not impacted by this event. However, access to Amazon S3 buckets and DynamoDB tables via VPC Endpoints was impaired during this event.What does this even mean ? I bet most people use DynamoDB via a VPC, in a Lambda or in EC2

评论 #29516744 未加载

评论 #29516774 未加载

评论 #29516733 未加载

xyst超过 3 年前

I am not a fan of AWS due to their substantial market share on cloud computing. But as a software engineer I do appreciate their ability to provide fast turnarounds on root cause analyses and make them public.

评论 #29522679 未加载

herodoturtle超过 3 年前

I am grateful to AWS for this report.Not sure if any AWS support staff are monitoring this thread, but the article said:> Customers also experienced login failures to the AWS Console in the impacted region during the event.All our AWS instances / resources are in EU/UK availability zones, and yet we couldn't access our console either.Thankfully none of our instances were affected by the outage, but our inability to access the console was quite worrying.Any idea why this was this case?Any suggestions to mitigate this risk in the event of a future outage would be appreciated.

评论 #29519713 未加载

评论 #29520599 未加载

llaolleh超过 3 年前

I wonder if they could've designed better circuit breakers for situations like this. They're very common in electrical engineering, but I don't think they're as common in software design. Something we should try to design and put in, actually for situations like this.

评论 #29516761 未加载

评论 #29516793 未加载

评论 #29516771 未加载

wly_cdgr超过 3 年前

Their service board is always as green as you have to be to trust it

markus_zhang超过 3 年前

>At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network.Just curious, is this scaling an AWS job or a client job? Looks like an AWS one from the context. I'm wondering if they are deploying additional data centers or something else?

JCM9超过 3 年前

Queue the armchair infrastructure engineers.The reality is that there’s a handful of people in the world that can operate systems at this sheer scale and complexity and I have mad respect for those in that camp.

评论 #29516980 未加载

评论 #29517423 未加载

评论 #29519890 未加载

评论 #29517970 未加载

评论 #29517365 未加载

评论 #29520431 未加载

Fordec超过 3 年前

Between this and Log4j, I'm just glad it's Friday.

评论 #29516709 未加载

评论 #29516686 未加载

iwallace超过 3 年前

My company uses AWS. We had significant degradation for many of their APIs for over six hours, having a substantive impact on our business. The entire time their outage board was solid green. We were in touch with their support people and knew it was bad but were under NDA not to discuss it with anyone.Of course problems and outages are going to happen, but saying they have five nines (99.999) uptime as measured by their "green board" is meaningless. During the event they were late and reluctant to report it and its significance. My point is that they are wrongly incentivized to keep the board green at all costs.

评论 #29516705 未加载

评论 #29517005 未加载

评论 #29516965 未加载

评论 #29516985 未加载

评论 #29516849 未加载

评论 #29516790 未加载

评论 #29517935 未加载

评论 #29516818 未加载

评论 #29516722 未加载

评论 #29517467 未加载

评论 #29516737 未加载

评论 #29517866 未加载

评论 #29517696 未加载

评论 #29516779 未加载

评论 #29517221 未加载

评论 #29517879 未加载

评论 #29517174 未加载

评论 #29517077 未加载

betaby超过 3 年前

Problem is that I have to defend our own infrastructure real availability numbers vs cloud's fictional "five nines". It's a loosing game.

评论 #29517802 未加载

评论 #29519176 未加载

评论 #29517654 未加载

yegle超过 3 年前

Was this outage only impact us-east-1 region? I think I saw other regions affected in some HN comments but this summary did not mention anything to suggest it has more than 1 region impacted.

评论 #29517575 未加载

AtlasBarfed超过 3 年前

"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries."So was this in service to something like DynamoDB or some other service?As in, did some of those extra services that AWS offers for lockin (and that undermines open source projects with embrace and extend) bomb the mainline EC2 service?Because this kind of smacks of "Microsoft Hidden APIs" that office got to use against other competitors. Does AWS use "special hardware capabilites" to compete against other companies offering roughtly the same service?

评论 #29518511 未加载

londons_explore超过 3 年前

Idea:. Network devices should be configured to automatically prioritize the same packet flows for the same clients as they served yesterday.So many overload issues seem to be caused by a single client, in a case where the right prioritization or rate limit rule could have contained any outage, but such a rule either wasn't in place or wasn't the right one due to the difficulty of knowing how to prioritize hundreds of clients.Using more bandwidth or requests than yesterday should then be handled as capacity allows, possibly with a manual configured priority list, cap, or ratio. But "what I used yesterday" should always be served first. That way, any outage is contained to clients acting differently to yesterday, even if the config isn't perfect.

cyounkins超过 3 年前

My favorite sentence: "Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event."

评论 #29517155 未加载

评论 #29517518 未加载

评论 #29517473 未加载

评论 #29516961 未加载

评论 #29517211 未加载

jtchang超过 3 年前

The complexity that AWS has to deal with is astounding. Sure having your main production network and a management network is common. But making sure all of it scales and doesn't bring down the other is what I think they are dealing with here.It must have been crazy hard to troubleshoot when you are flying blind because all your monitoring is unresponsive. Clearly more isolation with clearly delineated information exchange points are needed.

评论 #29516661 未加载

moogly超过 3 年前

Hm. This post does not seem to acknowledge what I saw. Multiple hours of rate-limiting kicking in when trying to talk to S3 (eu-west-1). After the incident everything works fine without any remediations done on our end.

评论 #29523336 未加载

hourislate超过 3 年前

Broadcast storm. Never easy to isolate, as a matter of fact it's nightmarish...

sponaugle超过 3 年前

"Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event. "That is an interesting way to phrase that. A 'well-tested' method, but 'latent issues'. That would imply the 'well-tested' part was not as well-tested as it needed to be. I guess 'latent issue' is the new 'bug'.

JCM9超过 3 年前

Obviously one hopes these things don’t happen, but that’s an impressive and transparent write up that came out quickly.

评论 #29517151 未加载

amznbyebyebye超过 3 年前

I’m glad they published something, that too so quick. Ultimately these guys are running a business. There are other market alternatives, multibillion dollar contracts at play, SLAs, etc. it’s not as simple as people think.

onion2k超过 3 年前

This congestion immediately impacted the availability of real-time monitoring data for our internal operations teamsI guess this is why it took ages for the status page to update. They didn't know which things to turn red.

StreamBright超过 3 年前

paulryanrogers超过 3 年前

Has anyone been credited by AWS for violations of their SLAs?

revskill超过 3 年前

Most of rate limiter system often drop invalid requests, it's not optimal as i see.The better way is, we should have two queues, one for valid messages and one for invalid messages.

whatever1超过 3 年前

Noob question, but why does network infrastructure need dns? Why the full ipv6 address of the various components do not suffice to do business?

评论 #29517756 未加载

评论 #29517836 未加载

User23超过 3 年前

A packet storm outage? Now that brings back memories. Last time I saw that it was rendezvous misbehaving.

rodmena超过 3 年前

umm... But just one thing, S3 was not available at least for 20 minutes.

sneak超过 3 年前

"impact" occurs 27 times on this page.What was wrong with "affect"?

评论 #29517994 未加载

stevefan1999超过 3 年前

In a nutshell: thundering herd.

atoav超过 3 年前

A "service event"?!

qwertyuiop_超过 3 年前

House of cards

eigen-vector超过 3 年前

Exceeded character limit on the title so I couldn't include this detail there, but this is the post-mortem of the event on December 7 2021.

bpodgursky超过 3 年前

DNS?Of course it was DNS.It is always* DNS.

评论 #29516759 未加载

评论 #29516738 未加载

pinche_gazpacho超过 3 年前

Yeah, cloudwatch APIs went to the drain. Good for them for publishing this at least.

foobarbecue超过 3 年前

"... the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region. By 8:22 AM PST, we were successfully updating the Service Health Dashboard."Sounds like they lost the ability to update the dashboard. HN comments at the time were theorizing it wasn't being updated due to bad policies (need CEO approval) etc. Didn't even occur to me that it might be stuck in green mode.

评论 #29516812 未加载

评论 #29516772 未加载

评论 #29516719 未加载

soheil超过 3 年前

Having an internal network like this that everything on the main AWS network so heavily depends on is just bad design. One does not create a stable high tech spacecraft and then fuels it with coal.

评论 #29517250 未加载

评论 #29518306 未加载

评论 #29517203 未加载

nayuki超过 3 年前

> Operators instead relied on logs to understand what was happening and initially identified elevated internal DNS errors. Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST, the team completed this work and DNS resolution errors fully recovered.Having DNS problems sounds a lot like the Facebook outage of 2021-10-04. <a href="https://en.wikipedia.org/wiki/2021_Facebook_outage" rel="nofollow">https://en.wikipedia.org/wiki/2021_Facebook_outage</a>

评论 #29516747 未加载

评论 #29516752 未加载

49 条评论

jetru超过 3 年前

评论 #29520634 未加载

评论 #29519280 未加载

评论 #29519575 未加载

评论 #29520268 未加载

评论 #29519305 未加载

评论 #29520766 未加载

评论 #29520552 未加载

评论 #29520256 未加载

评论 #29524799 未加载

评论 #29519508 未加载

评论 #29519695 未加载

评论 #29520158 未加载

评论 #29520369 未加载

azundo超过 3 年前

评论 #29518638 未加载

评论 #29518852 未加载

评论 #29519642 未加载

DenisM超过 3 年前

> Customers accessing Amazon S3 and DynamoDB were not impacted by this event.We've seen plenty of S3 errors during that period. Kind of undermines credibility of this report.

评论 #29517047 未加载

评论 #29518157 未加载

评论 #29520511 未加载

评论 #29517773 未加载

tyingq超过 3 年前

评论 #29518529 未加载

评论 #29518481 未加载

评论 #29520819 未加载

divbzero超过 3 年前

评论 #29518899 未加载

评论 #29522525 未加载

评论 #29519741 未加载

mperham超过 3 年前

I wish it contained actual detail and wasn’t couched in generalities.

评论 #29517732 未加载

propter_hoc超过 3 年前

评论 #29516832 未加载

评论 #29516681 未加载

评论 #29517023 未加载

评论 #29517255 未加载

评论 #29517724 未加载

评论 #29517586 未加载

评论 #29516669 未加载

wjossey超过 3 年前

评论 #29518650 未加载

评论 #29518214 未加载

评论 #29518719 未加载

评论 #29518995 未加载

评论 #29519547 未加载

评论 #29518652 未加载

评论 #29518743 未加载

评论 #29520585 未加载

评论 #29518749 未加载

almostdeadguy超过 3 年前

评论 #29517492 未加载

bamboozled超过 3 年前

评论 #29516811 未加载

评论 #29516780 未加载

Ensorceled超过 3 年前

raffraffraff超过 3 年前

simlevesque超过 3 年前

评论 #29516744 未加载

评论 #29516774 未加载

评论 #29516733 未加载

xyst超过 3 年前

评论 #29522679 未加载

herodoturtle超过 3 年前

评论 #29519713 未加载

评论 #29520599 未加载

llaolleh超过 3 年前

评论 #29516761 未加载

评论 #29516793 未加载

评论 #29516771 未加载

wly_cdgr超过 3 年前

Their service board is always as green as you have to be to trust it

markus_zhang超过 3 年前

JCM9超过 3 年前

评论 #29516980 未加载

评论 #29517423 未加载

评论 #29519890 未加载

评论 #29517970 未加载

评论 #29517365 未加载

评论 #29520431 未加载

Fordec超过 3 年前

Between this and Log4j, I'm just glad it's Friday.

评论 #29516709 未加载

评论 #29516686 未加载

iwallace超过 3 年前

评论 #29516705 未加载

评论 #29517005 未加载

评论 #29516965 未加载

评论 #29516985 未加载

评论 #29516849 未加载

评论 #29516790 未加载

评论 #29517935 未加载

评论 #29516818 未加载

评论 #29516722 未加载

评论 #29517467 未加载

评论 #29516737 未加载

评论 #29517866 未加载

评论 #29517696 未加载

评论 #29516779 未加载

评论 #29517221 未加载

评论 #29517879 未加载

评论 #29517174 未加载

评论 #29517077 未加载

betaby超过 3 年前

Problem is that I have to defend our own infrastructure real availability numbers vs cloud's fictional "five nines". It's a loosing game.

评论 #29517802 未加载

评论 #29519176 未加载

评论 #29517654 未加载

yegle超过 3 年前

Was this outage only impact us-east-1 region? I think I saw other regions affected in some HN comments but this summary did not mention anything to suggest it has more than 1 region impacted.

评论 #29517575 未加载

AtlasBarfed超过 3 年前

评论 #29518511 未加载

londons_explore超过 3 年前

cyounkins超过 3 年前

评论 #29517155 未加载

评论 #29517518 未加载

评论 #29517473 未加载

评论 #29516961 未加载

评论 #29517211 未加载

jtchang超过 3 年前

评论 #29516661 未加载

moogly超过 3 年前

评论 #29523336 未加载

hourislate超过 3 年前

Broadcast storm. Never easy to isolate, as a matter of fact it's nightmarish...

sponaugle超过 3 年前

JCM9超过 3 年前

Obviously one hopes these things don’t happen, but that’s an impressive and transparent write up that came out quickly.

评论 #29517151 未加载

amznbyebyebye超过 3 年前

onion2k超过 3 年前

StreamBright超过 3 年前

paulryanrogers超过 3 年前

Has anyone been credited by AWS for violations of their SLAs?

revskill超过 3 年前

Most of rate limiter system often drop invalid requests, it's not optimal as i see.The better way is, we should have two queues, one for valid messages and one for invalid messages.

whatever1超过 3 年前

Noob question, but why does network infrastructure need dns? Why the full ipv6 address of the various components do not suffice to do business?

评论 #29517756 未加载

评论 #29517836 未加载

User23超过 3 年前

A packet storm outage? Now that brings back memories. Last time I saw that it was rendezvous misbehaving.

rodmena超过 3 年前

umm... But just one thing, S3 was not available at least for 20 minutes.

sneak超过 3 年前

"impact" occurs 27 times on this page.What was wrong with "affect"?

评论 #29517994 未加载

stevefan1999超过 3 年前

In a nutshell: thundering herd.

atoav超过 3 年前

A "service event"?!

qwertyuiop_超过 3 年前

House of cards

eigen-vector超过 3 年前

Exceeded character limit on the title so I couldn't include this detail there, but this is the post-mortem of the event on December 7 2021.

bpodgursky超过 3 年前

DNS?Of course it was DNS.It is always* DNS.

评论 #29516759 未加载

评论 #29516738 未加载

pinche_gazpacho超过 3 年前

Yeah, cloudwatch APIs went to the drain. Good for them for publishing this at least.

foobarbecue超过 3 年前

评论 #29516812 未加载

评论 #29516772 未加载

评论 #29516719 未加载

soheil超过 3 年前

Having an internal network like this that everything on the main AWS network so heavily depends on is just bad design. One does not create a stable high tech spacecraft and then fuels it with coal.