I feel for the OP, I really do. Have been through this many times before.<p>But it is going too far to call this a "lie" on the part of AWS. If you talk to any AWS customer support representative, or read their docs, they are very clear that the unit of availability for their service is a <i>region</i>, not a <i>zone</i>.<p>Notice that the status update said "connectivity issues in a single Availability Zone".<p>If you are deploying in a single AZ, and your app is not tolerant to at least one AZ failure, then you should not be telling your customers/boss that your app is highly available. That's not the fault of AWS, it's how you're using AWS.<p>With that said, I do think that AWS could do two things: 1) maybe show a yellow warning sign instead of a blue "i", for something that borks an entire AZ, and 2) make it even more clear that each AZ is not high-availability.
I have worked at amazon, and I can validate this. When I was on an AWS team, posting a "non-green" status to the status page was actually a Manager decision. That means that it is in no way real time, and its possible the status might say "everything is ok" when its really not because a manager doesnt think its a big enough deal.
Also there is a status called green-i which is the green icon with a little 'i' information bubble. Almost everytime something was seriously wrong, our status was green-i, because the "level" of warning is ultimately a manager decision. So they avoid yellow and red as much as possible. So the status's aren't very accurate.
That being said, if you do see a yellow or red status on AWS, you know things are REALLY bad.
Note that if they did have non-green, they might have to credit you due to SLAs. There's likely a strong disincentive internally to have a problem cause non-green status, because it is a metric that will cost amazon money. Amazon has service credits for reduced uptime. As Amazon employees are known for being very very harshly judged on metrics, lying on the status page is thus incentivized.<p>You are looking at the consequences of amazon's employee culture. Metrics or die means the metrics might need to be fudged to keep your and your team's jobs.
Aws is massively incentivized to lie on the status pages until the outages become egregious. It's literally a page for marketing to point to when describing how reliable the service is.
I agree that AWS could improve their openness around service problems. That said, what impacts some customers doesn't impact all customers. Ably may be offline when AWS has certain types of trouble, but that doesn't mean everyone with services in that AZ or region are having similar problems.<p>You also have to keep in mind that AWS status reports aren't just informational. They have a direct impact on how people use their service. As soon as AWS puts up a yellow dot, people start changing DNS and spinning up failovers in other regions or AZs, which has the potential to cascade into a much bigger failure when resources are unnecessarily tied up in other regions for something that maybe isn't as big a crisis as it sounds at first. This has happened before, and is part of the reason why AWS is so circumspect about announcing problems.<p>Point being that there are ways to architect around AZ and region failure that don't rely on trusting AWS's own reporting. Anyone with a significant investment in or dependency on AWS or any cloud service should not rely on those service's own indicators.<p>All that said, the fact is that in a system as big as AWS there are minor problems going on all the time. It would be healthier for their customers, their PR image, and the response to bigger incidents for them to reveal more details showing that yeah, most of the time some tiny fraction of the system is broken, offline, in a failure state, or some other unknown issue. 500 errors happen with the API sometime, let's see streaming feeds of fail rates from however many endpoints there are. EC2 instances have hardware failures or need to reboot sometimes, let's see some indication of whether that's happening more or less often than normal. Network segments fail or get overloaded... Revealing some of the less pristine guts of the operation (without revealing sensitive details of course) would go along way to being more honest without the bigger risks of saying EC2 in us-east-1 is down!
I've long learned that the status page from AWS is useless, it usually only updates 30-40 minutes after the problem starts (by which time, you've already noticed it) and they will always minimize the problem on the status page...<p>It's frustrating when there's an issue with AWS and my clients tell me that their status page reports that everything is OK.
I've had this happen to me multiple times. Luckily I use AWS for batch research jobs, so I'm not losing any money when things break.<p>But when researchers come to me with AWS problems even though I've not changed anything, I usually spend 10 minutes on one of the nodes for some cursory investigation and if it's not obvious what the problem is I tell them to wait a day or two before further investigation. 95% of the time, the problem clears up on its own.<p>I've learned the hard way that spending half a day tracing SQS messages, dynamo tables, spot bidding, etc is usually a waste of time.<p>That the AWS console flat out lies like this wastes so much of everyone's time. It's not even hard to fix, AWS could run internal test cases on every subnet group.
A while back, after some months of frustration with significant levels of S3 API failures in bulk that never made it to the status page, I ended up writing a tool that continually monitors S3 via making random requests, and alerts if too many fail per unit time. The outcome of that experiment is the finding that there are a lot of significant spikes in S3 API failure rates that go entirely unannounced by Amazon, though the situation has improved considerably in the past year.
AWS status page is a point of contention within the support teams as well. I've contacted them for problems when the status page says "everything's fine" and they have known outages. There is ALWAYS a lag between them knowing and anything being reported, usually on the order of 15-30 minutes. Most of the time their response is "Yes, there's a problem. I don't know why the status page isn't updated yet, it would really help us (support) out if the page was accurate."<p>Twitter is usually the best place to watch if you think it's not you, search #aws or #{service_name} and view the "Live" tab to see if others are reporting the same trouble.<p>Calling attention to these failures via support ticket, via sales rep and even publicly via @ customer support heads has made zero difference. Here's hoping this blog has enough visibility to make a difference.
The "real AWS status" extension for Chrome [1] pokes a little bit of fun at this. Basically, it hides all the regular green checkmarks (unless they are a 'Resolved' issue), then it increments the status images (green with I becomes yellow, yellow becomes red). It also makes the legend at the bottom snarky.<p>Although it's really meant as a joke, I think, it's actually really useful. It removes the noise and, frankly, the incremented status images are usually quite accurate for us.<p>[1] <a href="https://github.com/josegonzalez/real-aws-status" rel="nofollow">https://github.com/josegonzalez/real-aws-status</a>
Related to this:<p>My favorite is "We are experiencing an elevated level of errors" Which almost always translates to "This service is completely down!"<p>Although, with that said, they are not lying. 100% failure is "elevated" from "0"
This is definitely not unique only to AWS but also GCP in my own experience. We sometimes had Sentry error alerts due to DB connection error coming from the web app hundreds of times in a short window of 5 minutes and GCP dashboard would still shows green the entire time. Disclaimer: I have worked with GCP for 2 years and AWS for 8+ years.
Damn lies about the number of 9s. I face similar issues with Google Cloud on a daily basis with the APIs and their standard response is to go and buy and buy the next support level before they can look at the problem.<p>These companies claim to have state of art technologies where as in reality their customers inform them about outages and performance problems.
The exhausting reality is that in the age of widely distributed services, everything is always broken enough that someone's use case is going to choke and die. When the best case involves total failure for a small percentage of people, what does green even mean? Author had their own monitoring, that's the only workable approach.
This literally happened to me for the first time this week after years with AWS! My site was totally down for <i>6 hours</i> but AWS still reports everything as green and no notifications were sent.<p>Any suggestions for a very simple 3rd party "check" that will constantly monitor my site and alert me (email/text) when it's unresponsive?
I wish AWS had both high-level and drill down status pages where it would post API success rates or the equivalent detailed information like github's status page has [0].<p>[0] <a href="https://status.github.com/" rel="nofollow">https://status.github.com/</a>
Short version: AWS doesn't give any service status.<p>The have a status page but everything is always "green". It would take half the internet down before they move something to "not green (but not red either)".
Awhile ago I came up with this handy status page.<p><a href="http://i.imgur.com/adl7Yc3.png" rel="nofollow">http://i.imgur.com/adl7Yc3.png</a><p>Which should work in 100% of cases - you can just serve it statically.
This is unfortunately not a new issue: <a href="https://github.com/josegonzalez/real-aws-status" rel="nofollow">https://github.com/josegonzalez/real-aws-status</a>
My coworker has worked a lot with AWS:<p>"When Armageddon strikes and all of humanity is extinct, the last one to survive has to switch the AWS status flag to 'Red'"