My guess is this is all due to CloudWatch logs putlogevents failures.<p>By default a docker container configured with awslogs runs in "blocking" mode. As logs get logged, docker will buffer them and push to CloudWatch logs frequently. In case the log stream is faster than what the buffer can absorb, stdout/stderr get blocked and then the container will freeze on the logging write call. If putlogevents is failing, buffers are probably filling up and freezing containers. I assume most of AWS uses it's own logging system, which could cause these large, intermittent failures.<p>If you're okay dropping logs, add something like this to the container logging definition:<p><pre><code> "max-buffer-size": "25m"
"mode": "non-blocking"</code></pre>
It seems to have cascaded from AWS Kinesis...<p>[03:59 PM PDT] We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.<p>39 affected services listed:<p>AWS Application Migration Service<p>AWS Cloud9<p>AWS CloudShell<p>AWS CloudTrail<p>AWS CodeBuild<p>AWS DataSync<p>AWS Elemental<p>AWS Glue<p>AWS IAM Identity Center<p>AWS Identity and Access Management<p>AWS IoT Analytics<p>AWS IoT Device Defender<p>AWS IoT Device Management<p>AWS IoT Events<p>AWS IoT SiteWise<p>AWS IoT TwinMaker<p>AWS License Manager<p>AWS Organizations<p>AWS Step Functions<p>AWS Transfer Family<p>Amazon API Gateway<p>Amazon AppStream 2.0<p>Amazon CloudSearch<p>Amazon CloudWatch<p>Amazon Connect<p>Amazon EMR Serverless<p>Amazon Elastic Container Service<p>Amazon Kinesis Analytics<p>Amazon Kinesis Data Streams<p>Amazon Kinesis Firehose<p>Amazon Location Service<p>Amazon Managed Grafana<p>Amazon Managed Service for Prometheus<p>Amazon Managed Workflows for Apache Airflow<p>Amazon OpenSearch Service<p>Amazon Redshift<p>Amazon Simple Queue Service<p>Amazon Simple Storage Service<p>Amazon WorkSpaces
This is a bigger deal than the 'degraded' implies. SQS has basically ground to a halt for reads which is leading to massive slowdowns where I am at and the logging issues are causing task timeouts.
Our accounting system Xero is down, with reference on their status page to AWS. Related to this, I assume.<p><a href="https://status.xero.com/" rel="nofollow">https://status.xero.com/</a>