Console is flickering between "website is unavailable" and being up for my team. This is happening very frequently just now, reliability seems to have taken a hit.
If you haven't seen yet, news is it was a power loss:<p>> 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.
I've built out many 42U racks in DC's in my time and there were a couple of rules that we never skipped:<p>1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens
2. Dual network (at minimum) - For the same reasons as above since the switches didn't always have dual power in them.<p>I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.<p>However, a power cut (in the traditional sense where the supplier has a failure so nothing comes in over the wire) should have literally zero effect!<p>What am I missing?<p>I've never worked anywhere with Amazon's budget so why are they not handling this? Is it more than just the imcoming supply being down?
Every time a major cloud provider has an outage, Infra people and execs cry foul and say we need to move to <the other one>. But does anyone really have an objective measure of how clouds stack up reliability-wise? I doubt it, since outages and their effects are nuanced. The other move is that they want to go multi-cloud... But I’ve been involved in enough multi-cloud initiatives to know how much time and effort those soak up, not to mention the overhead costs of maintaining two sets of infra sub-optimally. I would say that for most businesses, these costs far exceed that occasional six-hour-long outage.
Is there a history of AWS downtimes available somewhere? This makes what, three times in as many months?<p>edit: The question isn't necessarily AWS specific, just any data on amount of downtime per cloud provider on a timeline would be nice.
AWS didn’t “go down”. They had an outage in one AZ, which is why there are multiple AZs in each region. If your app went down then you should be blaming your developers on this one, not AWS. Those having issues are discovering gaps in their HA designs.<p>Obviously it’s not good for an AZ to go down but it does happen and why any production workload should be architected to have seamless failover and recover to other AZs, typically by just dropping nodes in the down AZ.<p>People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.
The prevailing wisdom throughout the last couple of years was:<p>“ditch your on-prem infrastructure and migrate to a major cloud provider”<p>And its starting to seem like it could be something like:<p>“ditch your on-prem infrastructure and spin up your own managed cloud”<p>This is probably untenable for larger orgs where convenience gets the blank check treatment, but for smaller operations that can’t realize that value at scale and are spooked by these outages, what are the alternatives?
Slack seems to have some issues because of that - I'm not sure if anyone is receiving messages, as it became completely silent for the last 15 minutes or so.
I guess that's why I'm experiencing weird issues with Heroku:<p><pre><code> remote: Compressing source files... done.
remote: Building source:
remote:
remote: ! Heroku Git error, please try again shortly.
remote: ! See http://status.heroku.com for current Heroku platform status.
remote: ! If the problem persists, please open a ticket
remote: ! on https://help.heroku.com/tickets/new</code></pre>
5ish years ago it was common knowledge that us-east-1 is generally the worst place to put anything that needs to be reliable. I guess this is still true?
One of our EC2 instances in us-east-1c is unavailable and stuck in "stopping" state after a force stop.
Interestingly enough, EC2 instances in us-east-1b don't seem to be affected.<p>The console is throwing errors from time to time. As usual no information on AWS status page.
Now that everyone and their dog is on AWS, it is not just 'a website stops working', half the world, from telephones to security doors and Iot equipment, stops working?<p>I am not sure if the movement the cloud has reduced amount of failures, but it definitely has made these failures more catastrophic.<p>Our profession is busy makin the world less reliable and more fragile, we will have our reconning just like the shipping industry did.
4:35 AM PST We are investigating increased EC2 launched failures and networking connectivity issues for some instances in a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. Other Availability Zones within the US-EAST-1 Region are not affected by this issue.<p>via <a href="https://stop.lying.cloud/" rel="nofollow">https://stop.lying.cloud/</a>
It seems that it's due to powerloss.<p>[05:01 AM PST] We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.
Fields of green here <a href="https://status.aws.amazon.com/" rel="nofollow">https://status.aws.amazon.com/</a>
Anyway I can access the web console with no issue (eu-west)
I do wonder if the great resignation has anything to do with this. My team (no affiliation with Amazon) was cut in half from last year and we are struggling to keep up with all the work
Invision image uploads are down too because of this : <a href="https://status.invisionapp.com/" rel="nofollow">https://status.invisionapp.com/</a>
Question to the sysadmins here: Is it really that outrageous of amazon to have such issues or are people way to spoiled to appreciate the effort that goes into maintaining such a service?<p>Edit: Not supporting amazon, i generally dislike the company. I just don't understand the extend to which the criticism is justified
So why are people not migrating out of us-east-1? Operating in ap-southeast, we weren't that affected by the us-east-1 down time, although our system is reasonably static and doesn't make lots of IAM calls (which seems to be a large SPOF from us-east-1).
Bitbucket is down as well because of this. <a href="https://bitbucket.status.atlassian.com/incidents/r8kyb5w606g5" rel="nofollow">https://bitbucket.status.atlassian.com/incidents/r8kyb5w606g...</a>
My Elastic Beanstalk instances are completely unreachable. Seems at the very least ELB is down. Looking @ down detector it looks like this is taking a bunch of sites down with it. As usual AWS status page shows all green.
As an industry, can we please stop making products like vacuums that can't operate unless someone else's computer is working in a field in Virgina? There's literally no reason for it.
Looks like the SEC's Edgar website is affected. This is the site the SEC uses to post the filings of public companies. Normally there are a hundred or more company filings in the morning starting at 6am ET. This morning there are two.<p><a href="https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent" rel="nofollow">https://www.sec.gov/cgi-bin/browse-edgar?action=getcurrent</a>
Hubspot seems to be down too [0].<p>[0] <a href="https://status.hubspot.com/" rel="nofollow">https://status.hubspot.com/</a>
As far as I understood a whole availability zone went down; today is also the day a lot of people understand why "multi-AZ" matters, so I don't think it's fair to say that services are down because the whole AWS is down.
Where are you located? "X is down" without location is only moderately useful.<p>I'm having issues with Slack from central EU (Poland) -- can't upload images, or send emoji reactions to post; curiously, text works fine). Wondering if linked
Here is The Internet Report episode on the topic of recent AWS outages that covers outage and root causes: <a href="https://youtu.be/N68pQy8r1DI" rel="nofollow">https://youtu.be/N68pQy8r1DI</a>
2 of our servers are fucked right now. VOIP services down.<p>Only with AWS and Github do I seem get panicked text messages on my phone first thing in the morning... Our workloads on Azure typically only have faults when everyone is in bed.
We'll never really know the answer, but I have to wonder what percentage of comments on this thread are from Amazon downplaying the severity & other cloud providers hyping it up.
I also had problem with loading youtube at the same time(for 10-15 minutes) . It looks like a coincidence, but who knows if google uses some of the infrastructure from aws.
I used to think it was silly to have your own hardware (like a NAS) in your house. What makes you think you can do it better than AWS?<p>Santa is bringing me a Synology in three days.
Assuming crates.io is AWS-backed? Getting fun situation where direct dependencies of an application are downloading but then the sub-dependencies aren't.
Of all the AWS outage, my team and I have dodged them all, except this one. 3 instances down and unavailable<p>> Due to this degradation your instance could already be unreachable<p>>:(
Can we please stop saying, “AWS is down”?<p>AWS consists of over 200 services offered in 86 availability zones in 26 regions each with their own availability.<p>If one service in one availability zone being impaired equals a post about “AWS is down” we might as well auto-post that every day.
Isn't the point of the design of an availability zone having multiple data centers so that if a single data center in the availability zone fails, services aren't affected?
Me: <i>Hesitation at last job moving absolutely everything (including backups) to AWS because if it goes down it's a problem</i> I'm a firm believer in <i>some kind of</i> physical/easily accessible backup.<p>Coworkers: "You're an f'n idiot. Amazon and Facebook don't go down, you're holding us back!" <-Quite literally their words.<p>Me: <i>leaves cause that treatment was the final straw</i><p>Amazon and Facebook both go down within a month of each other, and supposedly they needed backups<p>Them: <i>shocked pikachu face</i>