AWS Cognito is having issues and health dashboards are still green

492 点作者 rcardo11超过 4 年前

61 条评论

We hired an engineer out of Amazon AWS at a previous company.Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.

评论 #25214242 未加载

评论 #25213938 未加载

评论 #25213923 未加载

评论 #25213841 未加载

评论 #25215394 未加载

评论 #25215266 未加载

评论 #25214331 未加载

评论 #25233062 未加载

评论 #25213991 未加载

评论 #25215518 未加载

评论 #25213986 未加载

piewzko超过 4 年前

Now is probably a good time to plug some of the open source alternatives to vendor locked in identity solutions:- <a href="https://github.com/ory" rel="nofollow">https://github.com/ory</a>- <a href="https://github.com/dexidp/dex" rel="nofollow">https://github.com/dexidp/dex</a>- <a href="https://github.com/authelia/authelia" rel="nofollow">https://github.com/authelia/authelia</a>- <a href="https://github.com/keycloak/keycloak" rel="nofollow">https://github.com/keycloak/keycloak</a>- <a href="https://www.gluu.org/" rel="nofollow">https://www.gluu.org/</a>- <a href="https://github.com/accounts-js/accounts" rel="nofollow">https://github.com/accounts-js/accounts</a>

评论 #25214085 未加载

评论 #25213800 未加载

评论 #25213066 未加载

评论 #25212960 未加载

评论 #25220326 未加载

评论 #25215177 未加载

评论 #25214301 未加载

评论 #25212832 未加载

rcardo11超过 4 年前

> This is also causing issues with Amplify, API Gateway, AppStream2, AppSync, Athena, Cloudformation, Cloudtrail, Cloudwatch, Cognito, DynamoDB, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker, and Workspaces.Well, this is a major outgage

评论 #25210744 未加载

评论 #25212462 未加载

评论 #25211481 未加载

评论 #25211631 未加载

评论 #25214473 未加载

评论 #25214924 未加载

评论 #25212389 未加载

评论 #25212319 未加载

评论 #25212116 未加载

driverdan超过 4 年前

Five hours later and nothing has changed. For a company like Amazon this should be unacceptable.Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.

评论 #25215616 未加载

评论 #25214578 未加载

评论 #25215307 未加载

评论 #25214939 未加载

bithavoc超过 4 年前

"I want to have an AWS region where everything breaks with high frequency..."[0] discussed here [1][0] <a href="https://twitter.com/apgwoz/status/1292519906433306625?s=20" rel="nofollow">https://twitter.com/apgwoz/status/1292519906433306625?s=20</a>[1] <a href="https://news.ycombinator.com/item?id=24103746" rel="nofollow">https://news.ycombinator.com/item?id=24103746</a>

评论 #25211578 未加载

评论 #25211804 未加载

turdnagel超过 4 年前

Isn't it common practice to host your status board on someone else's infrastructure?In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.

评论 #25211004 未加载

评论 #25212909 未加载

评论 #25210664 未加载

评论 #25211171 未加载

评论 #25233015 未加载

throwaway343432超过 4 年前

Large-scale events (LSEs) are becoming more and more common. It'll keep getting worse.AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.AWS will be in deep trouble when/if GCE fixes their customer support.

评论 #25221649 未加载

s_dev超过 4 年前

Can anyone explain why status pages are so difficult. Theres even statups like status.io dedicated to this one thing.It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.How did AWS status page compare with status.io/aws?

评论 #25210603 未加载

评论 #25210995 未加载

评论 #25210612 未加载

评论 #25210861 未加载

评论 #25212546 未加载

评论 #25211034 未加载

评论 #25210614 未加载

评论 #25210736 未加载

评论 #25212765 未加载

评论 #25214393 未加载

评论 #25211530 未加载

zxcvbn4038超过 4 年前

I think we are learning everything that uses AWS Kinesis internally which is cool. It’s always fascinating to learn how AWS works on the backend.

评论 #25212485 未加载

drfritznunkie超过 4 年前

Cognito is one of the most frustrating AWS services I have to work with, it is almost, but not quite, entirely unlike an SP.We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.Has anyone figured out how to set up Cognito in multiple regions without the hijinx of having the customer setup trusts for each region? Not to mention, while multiple trusts are I think possible with ADFS (not that I've tested it), I'm pretty sure that Okta doesn't support multiple trusts, so regardless of how many regions, we'd still be SOL there...

评论 #25212182 未加载

_0o6v超过 4 年前

> It's not posted on SHD as the issue has impacted our ability to post there.Is that not a massive catch-22 for a service dashboard?

评论 #25210354 未加载

评论 #25211027 未加载

0xmohit超过 4 年前

Almost 9 years have passed by and nothing has changed. The dashboards continue to remain green.<a href="https://news.ycombinator.com/item?id=3707590" rel="nofollow">https://news.ycombinator.com/item?id=3707590</a>

评论 #25212789 未加载

camhart超过 4 年前

"This issue has also affected our ability to post updates to the Service Health Dashboard."Last sentence of the alert at the top of the page.

评论 #25210711 未加载

LeoTM超过 4 年前

I'm in the UK and this now may have cascaded onto VISA<a href="https://downdetector.co.uk/status/visa/map/" rel="nofollow">https://downdetector.co.uk/status/visa/map/</a>I am unable to order my Papa Johns pizza<a href="https://imgur.com/u5QSszv" rel="nofollow">https://imgur.com/u5QSszv</a>

评论 #25214086 未加载

vishesh92超过 4 年前

> "This issue has also affected our ability to post updates to the Service Health Dashboard."This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.

Steve886超过 4 年前

Many applications – Including Anchor, Adobe Spark, Flickr, SiriusXM and Roku reported disruption caused by this outage. <a href="https://news.alphastreet.com/huge-aws-outage-affects-a-wide-range-of-applications/" rel="nofollow">https://news.alphastreet.com/huge-aws-outage-affects-a-wide-...</a>

评论 #25212950 未加载

pluc超过 4 年前

Rule #1 of status pages: never put your status page on the same infrastructure it monitors.

dsagal超过 4 年前

Banner on top of <a href="https://status.aws.amazon.com/" rel="nofollow">https://status.aws.amazon.com/</a> just has an update from 8:36AM PST -- just removed -- even thought it's only 7:42AM PST. I guess it's really manual firefighting there.

unilynx超过 4 年前

There's a lot more going on over there...- 7 cloudfront distributions created today are still in "InProgress", a few already for more than one hour- The support case I created about it doesn't show up in my support portal. Direct link to it does work though

评论 #25210929 未加载

评论 #25210696 未加载

swasheck超过 4 年前

Ah yes. It's the annual AWS Thanksgiving Holiday major us-east outage.

评论 #25212358 未加载

dubcanada超过 4 年前

7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in errors for Kinesis in the US-EAST-1 Region. It's not posted on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.Was posted 8 minutes ago.

mcintyre1994超过 4 年前

> 2:43 PM PST Between 5:15 AM and 2:28 PM PST customers experienced increased API failure rates for Cognito User Pools and Identity Pools in the US-EAST-1 Region. This was due to an issue with Kinesis Data Streams. We have implemented a mitigation to this issue. Cognito is now operating normally.Seems like they fixed Cognito while Kinesis and many other services are still broken - presumably somehow removing the dependency on Kinesis? It’ll be really interesting if their post mortem explains this mitigation.

reese_john超过 4 年前

Kinesis seems to be down to me. Everything is melting, it is like they have Chaos Monkey perpetually on in us-east-1

zedpm超过 4 年前

Every time I check the Personal Health Dashboard, the number of issues increases; it's currently showing 13 open issues for my account. Cloudwatch logs for the last few hours are unavailable; it appears that the log agent is getting errors when it attempts to upload log events. Metrics are spotty or missing.

tysoncadenhead超过 4 年前

Maybe AWS should put their dashboards on GCP

评论 #25211770 未加载

rmujica超过 4 年前

It is now affecting ECS and EKS. Having problems scaling own nodes.

MR4D超过 4 年前

And that confirms it for me: Amazon is officially a Day 2 Company.Happened faster than I thought, but based on reading the comments about people who work(ed) there, this seems cut and dried to me.

kristianpaul超过 4 年前

CloudWatch is definitely one of those "AWS Primitives" services that side effects others when having problems, something similar happened with DynamoDB some years ago.

评论 #25216379 未加载

xyst超过 4 年前

just looking at this dashboard, I never realized how many services aws has to offer. I’d hate to be the “aws” guy

leothekim超过 4 年前

> This issue has also affected our ability to post updates to the Service Health Dashboard.This is when you fall back to the Tumblr blog for status updates.<rimshot>

mushufasa超过 4 年前

what a scam. who can hold them accountable for cheating those who paid for uptime guarantees?I guess the lawyers of those who paid for uptime guarantees...

评论 #25210382 未加载

评论 #25213644 未加载

评论 #25216679 未加载

评论 #25210459 未加载

dbenny超过 4 年前

Apparently they can't update the status page because of the outage. This happened a few years ago with the massive s3 outage.

carusooneliner超过 4 年前

We are seeing an elevated rate of failures on our service, which depends on AWS Cognito. Tweeted an update on it: <a href="https://twitter.com/outklip/status/1331705524396625924" rel="nofollow">https://twitter.com/outklip/status/1331705524396625924</a>

btown超过 4 年前

As of 2020-11-25T17:21Z this is also causing a Heroku outage preventing new spin-ups, which presumably uses these APIs to verify instance health. <a href="https://status.heroku.com/" rel="nofollow">https://status.heroku.com/</a>

doseofreality超过 4 年前

friends don’t let friends use us-east-1.

symlinkk超过 4 年前

Title should be changed, this is a widespread AWS issue, it’s not specific to Cognito.

snvzz超过 4 年前

This is what's called a SNAFU.

LeoTM超过 4 年前

Experiencing 504's from Cognito too, our users can't log in."amazon-cognito-identity-js": "^3.2.2" "aws-amplify": "^2.2.2"

评论 #25213757 未加载

simlevesque超过 4 年前

I'm getting tired of that bullshit. Just admit it.

评论 #25210379 未加载

评论 #25214672 未加载

skavish超过 4 年前

Mediaconvert just stopped processing our queues two hours ago, in all our accounts. Anybody else is having it? It's green on the status board.

评论 #25210647 未加载

bengalister超过 4 年前

Same issue with AWS lambdas I got a: Received malformed response from transform AWS::Serverless-2016-10-31.It is reported now in their service health dashboard.

agustif超过 4 年前

We've been having lots of issues with Vercel today, since it uses AWS under the hood I'm guessing that's related...

astatine超过 4 年前

The way ahead should be an independent entity who audits systems and has responsibility to certify that the dashboard represents a true and accurate view of the actual status. Like is done so effectively with company financials.Oh, wait! EY, PWC, and who can forget Arthur Andersen!But, naturally, technology people can solve this better than anyone else, right?

arusahni超过 4 年前

All my CloudWatch alerts are firing "OK" transitions, and AWS ES isn't displaying any known instances

garymoon超过 4 年前

We experienced 504 errors from Cognito but seems to be that other services are affected as well

评论 #25209920 未加载

Nexeo超过 4 年前

Is anyone else getting "Capacity unavailable" when trying to add tasks in Fargate?

sk5t超过 4 年前

EventBridge has been struggling for about the past 14 hours as well, which means Cloudwatch Events is not too happy; and, I have the impression CWE underpins a surprising diversity of other things at AWS.

troelsSteegin超过 4 年前

Would this explain the washingtonpost.com outage? That site has been displaying a "Welcome to OpenResty!" page for the past 20 minutes or so.EDIT: nevermind, the Post is back, and Kinesis is still erroring.

edoceo超过 4 年前

Also this thread <a href="https://news.ycombinator.com/item?id=25209508" rel="nofollow">https://news.ycombinator.com/item?id=25209508</a>

TedShiller超过 4 年前

AWS Status website is down for me.Is there a status website for AWS Status?

revicon超过 4 年前

I’m trying to find a doc on running cognito using multiple zones and I can’t find much. Anyone have a multi-az cognito deployment running right now?

评论 #25214003 未加载

tootie超过 4 年前

Any tips on how to collect on SLA credits from this?

评论 #25211849 未加载

Erlangen超过 4 年前

Is this the reason I have seen connection errors in duolingo,> upstream connect error or disconnect/reset before headers. reset reason: overflow

dtjones超过 4 年前

We're getting 504 for our well-known jwks fileAnd request timeouts against cognito-idp.us-east-1.amazonaws.comAnd the cognito console won't load

tibbar超过 4 年前

503s from CloudWatch for us.

outworlder超过 4 年前

They should rename that region to us-chaostesting-1 . Problem solved.

totaldude87超过 4 年前

i tried reaching out to amazon support, apparently they are also seeing issues internally and there is a high possibility that these two are related..Their ETA, 2 hours, and then try contacting again!

mikece超过 4 年前

Is this only affecting us-east-1 or other regions as well?

评论 #25210793 未加载

jadbox超过 4 年前

Having Lambda issues too

ssss11超过 4 年前

There is no problem here. jedi mind trick hand wave

shripadk超过 4 年前

Paddle checkout is down as well (connected to this outage): <a href="https://twitter.com/PaddleHQ/status/1331659286649466881" rel="nofollow">https://twitter.com/PaddleHQ/status/1331659286649466881</a>

61 条评论

PragmaticPulp超过 4 年前

评论 #25214242 未加载

评论 #25213938 未加载

评论 #25213923 未加载

评论 #25213841 未加载

评论 #25215394 未加载

评论 #25215266 未加载

评论 #25214331 未加载

评论 #25233062 未加载

评论 #25213991 未加载

评论 #25215518 未加载

评论 #25213986 未加载

piewzko超过 4 年前

评论 #25214085 未加载

评论 #25213800 未加载

评论 #25213066 未加载

评论 #25212960 未加载

评论 #25220326 未加载

评论 #25215177 未加载

评论 #25214301 未加载

评论 #25212832 未加载

rcardo11超过 4 年前

评论 #25210744 未加载

评论 #25212462 未加载

评论 #25211481 未加载

评论 #25211631 未加载

评论 #25214473 未加载

评论 #25214924 未加载

评论 #25212389 未加载

评论 #25212319 未加载

评论 #25212116 未加载

driverdan超过 4 年前

评论 #25215616 未加载

评论 #25214578 未加载

评论 #25215307 未加载

评论 #25214939 未加载

bithavoc超过 4 年前

评论 #25211578 未加载

评论 #25211804 未加载

turdnagel超过 4 年前

评论 #25211004 未加载

评论 #25212909 未加载

评论 #25210664 未加载

评论 #25211171 未加载

评论 #25233015 未加载

throwaway343432超过 4 年前

评论 #25221649 未加载

s_dev超过 4 年前

评论 #25210603 未加载

评论 #25210995 未加载

评论 #25210612 未加载

评论 #25210861 未加载

评论 #25212546 未加载

评论 #25211034 未加载

评论 #25210614 未加载

评论 #25210736 未加载

评论 #25212765 未加载

评论 #25214393 未加载

评论 #25211530 未加载

zxcvbn4038超过 4 年前

I think we are learning everything that uses AWS Kinesis internally which is cool. It’s always fascinating to learn how AWS works on the backend.

评论 #25212485 未加载

drfritznunkie超过 4 年前

评论 #25212182 未加载

_0o6v超过 4 年前

> It's not posted on SHD as the issue has impacted our ability to post there.Is that not a massive catch-22 for a service dashboard?

评论 #25210354 未加载

评论 #25211027 未加载

0xmohit超过 4 年前

评论 #25212789 未加载

camhart超过 4 年前

"This issue has also affected our ability to post updates to the Service Health Dashboard."Last sentence of the alert at the top of the page.

评论 #25210711 未加载

LeoTM超过 4 年前

评论 #25214086 未加载

vishesh92超过 4 年前

> "This issue has also affected our ability to post updates to the Service Health Dashboard."This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.

Steve886超过 4 年前

评论 #25212950 未加载

pluc超过 4 年前

Rule #1 of status pages: never put your status page on the same infrastructure it monitors.

dsagal超过 4 年前

unilynx超过 4 年前

评论 #25210929 未加载

评论 #25210696 未加载

swasheck超过 4 年前

Ah yes. It's the annual AWS Thanksgiving Holiday major us-east outage.

评论 #25212358 未加载

dubcanada超过 4 年前

mcintyre1994超过 4 年前

reese_john超过 4 年前

Kinesis seems to be down to me. Everything is melting, it is like they have Chaos Monkey perpetually on in us-east-1

zedpm超过 4 年前

tysoncadenhead超过 4 年前

Maybe AWS should put their dashboards on GCP

评论 #25211770 未加载

rmujica超过 4 年前

It is now affecting ECS and EKS. Having problems scaling own nodes.

MR4D超过 4 年前

And that confirms it for me: Amazon is officially a Day 2 Company.Happened faster than I thought, but based on reading the comments about people who work(ed) there, this seems cut and dried to me.

kristianpaul超过 4 年前

CloudWatch is definitely one of those "AWS Primitives" services that side effects others when having problems, something similar happened with DynamoDB some years ago.

评论 #25216379 未加载

xyst超过 4 年前

just looking at this dashboard, I never realized how many services aws has to offer. I’d hate to be the “aws” guy

leothekim超过 4 年前

> This issue has also affected our ability to post updates to the Service Health Dashboard.This is when you fall back to the Tumblr blog for status updates.<rimshot>

mushufasa超过 4 年前

what a scam. who can hold them accountable for cheating those who paid for uptime guarantees?I guess the lawyers of those who paid for uptime guarantees...

评论 #25210382 未加载

评论 #25213644 未加载

评论 #25216679 未加载

评论 #25210459 未加载

dbenny超过 4 年前

Apparently they can't update the status page because of the outage. This happened a few years ago with the massive s3 outage.

carusooneliner超过 4 年前

btown超过 4 年前

doseofreality超过 4 年前

friends don’t let friends use us-east-1.

symlinkk超过 4 年前

Title should be changed, this is a widespread AWS issue, it’s not specific to Cognito.

snvzz超过 4 年前

This is what's called a SNAFU.

LeoTM超过 4 年前

Experiencing 504's from Cognito too, our users can't log in."amazon-cognito-identity-js": "^3.2.2" "aws-amplify": "^2.2.2"

评论 #25213757 未加载

simlevesque超过 4 年前

I'm getting tired of that bullshit. Just admit it.

评论 #25210379 未加载

评论 #25214672 未加载

skavish超过 4 年前

Mediaconvert just stopped processing our queues two hours ago, in all our accounts. Anybody else is having it? It's green on the status board.

评论 #25210647 未加载

bengalister超过 4 年前

Same issue with AWS lambdas I got a: Received malformed response from transform AWS::Serverless-2016-10-31.It is reported now in their service health dashboard.

agustif超过 4 年前

We've been having lots of issues with Vercel today, since it uses AWS under the hood I'm guessing that's related...

astatine超过 4 年前

arusahni超过 4 年前

All my CloudWatch alerts are firing "OK" transitions, and AWS ES isn't displaying any known instances

garymoon超过 4 年前

We experienced 504 errors from Cognito but seems to be that other services are affected as well

评论 #25209920 未加载

Nexeo超过 4 年前

Is anyone else getting "Capacity unavailable" when trying to add tasks in Fargate?

sk5t超过 4 年前

troelsSteegin超过 4 年前

edoceo超过 4 年前

Also this thread <a href="https://news.ycombinator.com/item?id=25209508" rel="nofollow">https://news.ycombinator.com/item?id=25209508</a>

TedShiller超过 4 年前

AWS Status website is down for me.Is there a status website for AWS Status?

revicon超过 4 年前

I’m trying to find a doc on running cognito using multiple zones and I can’t find much. Anyone have a multi-az cognito deployment running right now?

评论 #25214003 未加载

tootie超过 4 年前

Any tips on how to collect on SLA credits from this?

评论 #25211849 未加载

Erlangen超过 4 年前

Is this the reason I have seen connection errors in duolingo,> upstream connect error or disconnect/reset before headers. reset reason: overflow

dtjones超过 4 年前

We're getting 504 for our well-known jwks fileAnd request timeouts against cognito-idp.us-east-1.amazonaws.comAnd the cognito console won't load

tibbar超过 4 年前

503s from CloudWatch for us.

outworlder超过 4 年前

They should rename that region to us-chaostesting-1 . Problem solved.

totaldude87超过 4 年前

mikece超过 4 年前

Is this only affecting us-east-1 or other regions as well?

评论 #25210793 未加载

jadbox超过 4 年前

Having Lambda issues too

ssss11超过 4 年前

There is no problem here. jedi mind trick hand wave

shripadk超过 4 年前