> <i>At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.</i><p>It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.
This part is also interesting:<p><i>> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.</i><p>These sorts of things make me understand why the Netflix "Chaos Gorilla" style of operating is so important. As they say in this post:<p><i>> We build our systems with the assumption that things will occasionally fail</i><p>Failure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.
> <i>From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3.</i><p>Ensuring that your status dashboard doesn't depend on the thing it's monitoring is probably the first thing you should think about when designing your status system. This doesn't fill me with confidence about how the rest of the system is designed, frankly...
"I did."<p>That was CEO Robert Allen's response when the AT&T network collapsed [1] on January 15, 1990<p>He was asked who made the mistake.<p>I can't imagine any CEO now a days making a similar statement.<p>[1] <a href="http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse.html" rel="nofollow">http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collaps...</a>
Not as interesting an explanation as I was hoping for. Someone accidentally typed "delete 100 nodes" instead of "delete 10 nodes" or something.<p>It sounds like the weakness in the process is that the tool they were using permitted destructive operations like that. The passage that stuck out to me: "in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."<p>At the organizational level, I guess it wasn't rated as all that likely that someone would try to remove capacity that would take a subsystem below its minimum. Building in a safeguard now makes sense as this new data point probably indicates that the likelihood of accidental deletion is higher than they had estimated.
Take a moment to look at the construction of this report.<p>There is no easily readable timeline. It is not discoverable from anywhere outside of social media or directly searching for it. As far as I know, customers were not emailed about this - I certainly wasn't.<p>You're an important business, AWS. Burying outage retrospectives and live service health data is what I expect from a much smaller shop, not the leader in cloud computing. We should all demand better.
> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.<p>I find making errors on production when you think you're on staging are a big one for similar errors. One of the best things I ever did on one job was to change the deployment script so that when you deployed you would get a prompt saying "Are you sure you want to deploy to production? Type 'production' to confirm". This helped stop several "oh my god, no!" situations when you repeated previous commands without thinking. For cases where you need to use SSH as well (best avoided but not always practical), it helps to use different colours, login banners and prompts for the terminals.
" we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected"<p>This is analogous to "we needed to fsck, and nobody realized how long that would take".
TLDR; Someone on the team ran a command by mistake that took everything down. Good, detailed description. It happens. Out of all of Amazon's offerings, I still love S3 the most.
"we have changed the SHD administration console to run across multiple AWS regions."<p>Dear Amazon: please lease a $25/month dedicated server to host your status page on.
AWS partitions its services into isolated regions. This is great for reducing blast radius. Unfortunately, us-east-1 has many times more load than any other region. This means that scaling problems hit us-east-1 before any other region, and affect the largest slice of customers.<p>The lesson is that partitioning your service into isolated regions is not enough. You need to partition your load evenly, too. I can think of several ways to accomplish this:<p>1. Adjust pricing to incentivize customers to move load away from overloaded regions. Amazon has historically done the opposite of this by offering cheaper prices in us-east-1.<p>2. Calculate a good default region for each customer and show that in all documentation, in the AWS console, and in code examples.<p>3. Provide tools to help customers choose the right region for their service. Example: <a href="http://www.cloudping.info/" rel="nofollow">http://www.cloudping.info/</a> (shameless plug).<p>4. Split the large regions into isolated partitions and allocate customers evenly across them. For example, split us-east-1 into 10 different isolated partitions. Each customer is assigned to a particular partition when they create their account. When they use services, they will use the instances of the services from their assigned partition.
So this is the second high profile outage in the last month caused by a simple command line mistake.<p>> Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.<p>If I would have guessed anyone could prevent mistakes like this from propagating it would be AWS. It points to just how easy it is to make these errors. I am sure that the SRE who made this mistake is amazing and competent and just had one bad moment.<p>While I hope that AWS would be as understanding as Gitlab, I doubt the outcome is the same.
tl;dr: Engineer fat-fingered a command and shut everything down. Booting it back up took a long time. Then the backlog was huge, so getting back to normal took even longer. We made the command safer, and are gonna make stuff boot faster. Finally, we couldn’t report any of this on the service status dashboard, because we’re idiots, and the dashboard runs on AWS.
Overall, it's pretty amazing that the recovery was as fast as it was. Given the throughput of S3 API calls you can imagine the kind of capacity that's needed to do a full stop followed by a full start. Cold-starting a service when it has heavy traffic immediately pouring into it can be a nightmare.<p>It'd be very interesting to know what kind of tech they use at AWS to throttle or do circuit breaking to allow back-end services like the indexer to come up in a manageable way.
Something that wasn't addressed -- there seems to be an architectural issue with ELB where ELBs with S3 access logs enabled had instances fail ELB health checks, presumably while the S3 API was returning 5XX. My load balancers in us-east-1 without access logs enabled were fine throughout this event. Has there been any word on this?
Really pleased to see this, it's good to see an organisation that's being transparent (and maybe given us a little peek under the hood of how S3 is architected) and most importantly they seem quite humbled.<p>It would be easy for an arrogant organisation to fire or negatively impact the person that made the mistake, I hope Amazon don't fall into that trap and focus instead on learning from what happened, closing the book and move on.
There are quite a few comments here ignoring the clarity that hindsight is giving them. Apparently the devops engineers commenting here have never fucked up.
This is a bit off topic. The use of the word "playbook" suggests to me that they use Ansible to help manage S3. I wonder if that is the case, or if it's just internal lingo that means "a script". Unless there is some other configuration management system that uses the word playbook that I'm not aware of.
What does everyone use S3 for?<p>I'm genuinely curious. As my experiments with it have left me disappointed with its performance, I'm just not sure what I could use it for. Store massive amounts of data that is infrequently accessed? Well, unfortunately the upload speed I got to the standard rating one was so abysmal it would take too much time to move the data there; and then I suspect the inverse would be pretty bad as well.
> (...) [W]e have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.<p>All those tweets saying "turn it off and back on again"?<p>"We accidentally turned it off, but it hasn't been turned it off for so long it took us hours to figure out how to turn it back on."<p>Poorly-presented jokes aside, this is rather concerning. The indexer and placement systems are SPOFs!! I mean, I'd <i>presume</i> these subsystems had ultra-low-latency hot failover, but this says <i>they never restarted</i>, and I wonder if AWS didn't simply invest a ton of magic pixie dust in making Absolutely Totally Sure™ the subsystems physically, literally never crashed in years. Impressive engineering but also very scary.<p>At least they've restarted it now.<p>And I'm guessing the current hires now know a <i>lot</i> about the indexer and placer, which won't do any harm to the sharding effort (I presume this'll be being sharded quicksmart).<p>I wonder if all the approval guys just photocopied their signatures onto a run of blank forms, heheh.
I keep being reminded of something I read recently that made me feel uneasy about google's cloud spanner [1]:<p><i>the most important one is that Spanner runs on Google’s private network. Unlike most wide-area networks, and especially the public internet, Google controls the entire network and thus can ensure redundancy of hardware and paths, and can also control upgrades and operations in general. Fibers will still be cut, and equipment will fail, but the overall system remains quite robust.
It also took years of operational improvements to get to this point. For much of the last decade, Google has improved its redundancy, its fault containment and, above all, its processes for evolution. We found that the network contributed less than 10% of Spanner’s already rare outages.</i><p>But when it fails it's going to be epic!<p>[1] <a href="https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Spanner-and-the-CAP-Theorem.html" rel="nofollow">https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Sp...</a>
I am unpleasantly surprised that they do not mention why services that should be unrelated to S3 such as SES were impacted as well and what they are doing to reduce such dependencies.<p>From a software development perspective, it makes sense to reuse S3 and rely on it internally if you need object storage, but from an ops perspective, it means that S3 is now a single point of failure and that SES's reliability will always be capped by S3's reliability. From a customer perspective, the hard dependency between SES and S3 is not obvious and is disappointing.<p>The whole internet was talking about S3 when the AWS status dashboard did not show any outage, but very few people mentioned other services such as SES. Next time we encounter errors with SES, should we check for hints of S3 outage before everything else? Should we also check for EC2 outage?
It's curious they needed to "remove capacity" to cure a slow billing problem.<p>Is that code for a "did you try to reboot the system?" kind of troubleshooting?<p>It sounds to me like the authorized engineer sent a command to reboot/reimage a large swath of the S3 infrastructure.
If Amazon were a guy, he'd be a standup guy. This is a very detailed and responsible explanation. S3 has revolutionized my businesses and I love that service to no end. These problems happen very rarely but I may have backups just in case using nginx proxy approach at some point and because S3 is so good, everyone seems to adopt their API so its just a matter of a switch statement. Werner can sweat less. Props.<p>I would add, it would be awesome if there was a simulation environment, beyond just a test environment that simulated servers outside requesting in, before a command was allowed to run onto production, like a robot deciding this, then could mitigate this, kind of like TDD on steriods if they don't have that already.
Ops and Engineering here.<p>My guts hurt just reading this.<p>With big failures is never just one thing. There are a series of mistakes, bad choices, and ignorance that lead to a big system wide failures.
Twitter once had 2 hours of downtime because an operations engineer accidentally asked a tool to restart all memcached servers instead of a certain server. The tool was then changed to make sure that you couldn't restart more than a few servers without additional confirmation. Sounds very similar to this situation. Something to think about when you are building your tools to be more error proof.
><i>Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.</i><p>We have geo-distributed systems. Load balancing and automatic failover. We agonize over edge cases that might cause issues. We build robust systems.<p>At the end of he day reliability -- a lot like security -- is most affected by the human factor.
> Removing a significant portion of the capacity caused each of these systems to require a full restart.<p>I'd be interested to understand why a cold restart was needed in the first place. That seems like kind of a big deal. I can understand many reasons why it might be necessary, but that seems like one of the issues that's important to address.
"we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years."<p>Yeah... nothing says "resilience" quite like that...
It sounds like this can be mitigated by making sure everything is run in dry run mode first, and for something mission critical, getting it double-checked by someone before removing the dry run constraint.<p>It's good practice in general, and I'm kind of astonished it's not part of the operational procedures in AWS, as this would have quickly been caught and fixed before ever going out to production.
"As a result, (personal experience and anecdotal evidence suggest that) for complex continuously available systems, Operations Error tends to be the weakest link in the uptime chain."<p><a href="https://zvzzt.wordpress.com/2012/08/16/a-note-on-uptime/" rel="nofollow">https://zvzzt.wordpress.com/2012/08/16/a-note-on-uptime/</a>
I guess it is time to define commands whose inputs have great distance in, say the Damerau-Levenshtein metric.<p>For numerical inputs, one might use both the digits and the textual expression. This would make them quite cumbersone but much less prone to errors. Or devise some shorthand for them...<p>156 (on fi six). 35. (zer th fi). 170 (on se zer). 28 (two eig)
evens have three letters odds have two.<p>This is just my 2 cents.
"People make mistakes all the time...the problem was that our systems that were designed to recognize and correct human error failed us." [1]<p>[1] <a href="http://articles.latimes.com/1999/oct/01/news/mn-17288" rel="nofollow">http://articles.latimes.com/1999/oct/01/news/mn-17288</a>
Amazon's AWS Outage Was More of a Design(UX) Failure and Less of Human Error.
<a href="https://www.linkedin.com/pulse/how-small-typo-caused-massive-downtime-s3aws-hemant-kumar-singh" rel="nofollow">https://www.linkedin.com/pulse/how-small-typo-caused-massive...</a>
Wonder if every numbers for critical command lines shouldn't be spelled out as well. If you think about how checks works, you're supposed to write the number as well as the words for the number.
-nbs two_hundreds instead of twenty
is much less likely to happen..<p>just like rm -rf / should really be rm -rf `root`
> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.<p>This is the bit that'd worry me most; you'd think they'd be testing this.
This caused panic and chaos for a bit among my team, which I imagine was replicated across the web.<p>Moments like these always remind me that a particularly clever or nefarious set of individuals could shut down essential parts of the Internet with a few surgical incisions.
Seems like something like Chaos monkey should have been able to predict and mitigate a issue like this. Im actually curious if anyone uses it at all- Curious if anyone in here at a large company (over 500 employees) has it deployed or not.
I think they should have led with insensitivity about it and maybe a white lie. Such as... We took our main region us-east-1 down for X hours because we wanted to remind people they need to design for failure of a region :-)<p>Shameless plugs (authored months ago):
<a href="http://tuxlabs.com/?p=380" rel="nofollow">http://tuxlabs.com/?p=380</a> - How To: Maximize Availability Effeciently Using AWS Availability Zones ( note read it, its not just about AZ's it is very clear to state multi-regions and better yet multi-cloud segway...second article)
<a href="http://tuxlabs.com/?p=430" rel="nofollow">http://tuxlabs.com/?p=430</a> - AWS, Google Cloud, Azure and the singularity of the future Internet
This makes me want to write a program that would ask users to confirm commands if it thinks they are running a known playbook and deviating from it. Does anyone know if a tool like that exists?
For as much as people jumped all over Gitlab last month, this seems remarkably similar in terms of preparedness for accidental and unanticipated failure.
No on in HN is questioning this - "The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected." - they are debugging on Production System..
What most AWS customers don't realize is that AWS is poorly automated. Their reliability relies on exploiting the employees to manually operate the systems. The technical bar at Amazon is incredibly low and they can't retain any good engineers.
What's missing is addressing the problems with their status page system, and how we all had to use Hacker News and other sources to confirm that US East was borked.
I think they should have led with insensitivity about it and maybe a white lie. Such as... We took our main region us-east-1 down for X hours because we wanted to remind people they need to design for failure of a region :-)<p>Shameless plugs (authored months ago):
<a href="http://tuxlabs.com/?p=380" rel="nofollow">http://tuxlabs.com/?p=380</a> - How To: Maximize Availability Effeciently Using AWS Availability Zones ( note read it, its not just about AZ's it is very clear to state multi-regions and better yet multi-cloud segway...second article)<p><a href="http://tuxlabs.com/?p=430" rel="nofollow">http://tuxlabs.com/?p=430</a> - AWS, Google Cloud, Azure and the singularity of the future Internet
For the many of us who have built businesses dependent on S3, is anyone else surprised at a few assumptions embedded here?<p>* "authorized S3 team member" -- how did this team member acquire these elevated privs?<p>* Running playbooks is done by one member without a second set of eyes or approval?<p>* "we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years"<p>The good news:<p>* "The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately."<p>The truly embarrassing that everyone has known about for years is the status page:<p>* "we were unable to update the individual services’ status on the AWS Service Health Dashboard "<p>When there is a wildly-popular Chrome plugin to <i>fix</i> your page ("Real AWS Status") you would think a company as responsive as AWS would have fixed this years ago.