Summary of the Amazon S3 Service Disruption

1246 pointsby oscarwaoabout 8 years ago

73 comments

ajrossabout 8 years ago

> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.It remains amazing to me that even with all the layers of automation, the root cause of most serious deployment problems remain some variant of a fat fingered user.

评论 #13776717 未加载

评论 #13776231 未加载

评论 #13775966 未加载

评论 #13776675 未加载

评论 #13776465 未加载

评论 #13776320 未加载

评论 #13776780 未加载

评论 #13776289 未加载

评论 #13776574 未加载

评论 #13780193 未加载

评论 #13777388 未加载

评论 #13781543 未加载

评论 #13778840 未加载

评论 #13779628 未加载

评论 #13784604 未加载

评论 #13778472 未加载

评论 #13776095 未加载

评论 #13779005 未加载

评论 #13779074 未加载

评论 #13777497 未加载

评论 #13777576 未加载

评论 #13779675 未加载

评论 #13776704 未加载

评论 #13776059 未加载

conorhabout 8 years ago

This part is also interesting:> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.These sorts of things make me understand why the Netflix "Chaos Gorilla" style of operating is so important. As they say in this post:> We build our systems with the assumption that things will occasionally failFailure at every level has to be simulated pretty often to understand how to handle it, and it is a really difficult problem to solve well.

评论 #13776615 未加载

评论 #13776045 未加载

评论 #13780144 未加载

评论 #13779935 未加载

jph00about 8 years ago

> From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3.Ensuring that your status dashboard doesn't depend on the thing it's monitoring is probably the first thing you should think about when designing your status system. This doesn't fill me with confidence about how the rest of the system is designed, frankly...

评论 #13776586 未加载

评论 #13776065 未加载

评论 #13776979 未加载

评论 #13778954 未加载

评论 #13779730 未加载

losteverythingabout 8 years ago

"I did."That was CEO Robert Allen's response when the AT&T network collapsed [1] on January 15, 1990He was asked who made the mistake.I can't imagine any CEO now a days making a similar statement.[1] <a href="http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse.html" rel="nofollow">http://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collaps...</a>

评论 #13777086 未加载

评论 #13778037 未加载

评论 #13777193 未加载

评论 #13776715 未加载

评论 #13776785 未加载

savanalyabout 8 years ago

Not as interesting an explanation as I was hoping for. Someone accidentally typed "delete 100 nodes" instead of "delete 10 nodes" or something.It sounds like the weakness in the process is that the tool they were using permitted destructive operations like that. The passage that stuck out to me: "in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."At the organizational level, I guess it wasn't rated as all that likely that someone would try to remove capacity that would take a subsystem below its minimum. Building in a safeguard now makes sense as this new data point probably indicates that the likelihood of accidental deletion is higher than they had estimated.

评论 #13776697 未加载

评论 #13776233 未加载

评论 #13777215 未加载

评论 #13776492 未加载

westernmostcoyabout 8 years ago

Take a moment to look at the construction of this report.There is no easily readable timeline. It is not discoverable from anywhere outside of social media or directly searching for it. As far as I know, customers were not emailed about this - I certainly wasn't.You're an important business, AWS. Burying outage retrospectives and live service health data is what I expect from a much smaller shop, not the leader in cloud computing. We should all demand better.

评论 #13777842 未加载

评论 #13777308 未加载

评论 #13777219 未加载

评论 #13778776 未加载

seanwilsonabout 8 years ago

> At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.I find making errors on production when you think you're on staging are a big one for similar errors. One of the best things I ever did on one job was to change the deployment script so that when you deployed you would get a prompt saying "Are you sure you want to deploy to production? Type 'production' to confirm". This helped stop several "oh my god, no!" situations when you repeated previous commands without thinking. For cases where you need to use SSH as well (best avoided but not always practical), it helps to use different colours, login banners and prompts for the terminals.

评论 #13776652 未加载

评论 #13778509 未加载

dsr_about 8 years ago

" we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected"This is analogous to "we needed to fsck, and nobody realized how long that would take".

评论 #13778221 未加载

评论 #13784929 未加载

评论 #13777609 未加载

magdabout 8 years ago

Oh, that interview question. “Tell me about something you broke in your last job"

评论 #13780348 未加载

评论 #13776823 未加载

评论 #13779383 未加载

评论 #13779998 未加载

评论 #13780599 未加载

评论 #13776214 未加载

orthecreedenceabout 8 years ago

TLDR; Someone on the team ran a command by mistake that took everything down. Good, detailed description. It happens. Out of all of Amazon's offerings, I still love S3 the most.

评论 #13776237 未加载

评论 #13776227 未加载

idlewordsabout 8 years ago

"we have changed the SHD administration console to run across multiple AWS regions."Dear Amazon: please lease a $25/month dedicated server to host your status page on.

评论 #13777428 未加载

评论 #13779515 未加载

mleonhardabout 8 years ago

AWS partitions its services into isolated regions. This is great for reducing blast radius. Unfortunately, us-east-1 has many times more load than any other region. This means that scaling problems hit us-east-1 before any other region, and affect the largest slice of customers.The lesson is that partitioning your service into isolated regions is not enough. You need to partition your load evenly, too. I can think of several ways to accomplish this:1. Adjust pricing to incentivize customers to move load away from overloaded regions. Amazon has historically done the opposite of this by offering cheaper prices in us-east-1.2. Calculate a good default region for each customer and show that in all documentation, in the AWS console, and in code examples.3. Provide tools to help customers choose the right region for their service. Example: <a href="http://www.cloudping.info/" rel="nofollow">http://www.cloudping.info/</a> (shameless plug).4. Split the large regions into isolated partitions and allocate customers evenly across them. For example, split us-east-1 into 10 different isolated partitions. Each customer is assigned to a particular partition when they create their account. When they use services, they will use the instances of the services from their assigned partition.

Dangerangerabout 8 years ago

So this is the second high profile outage in the last month caused by a simple command line mistake.> Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.If I would have guessed anyone could prevent mistakes like this from propagating it would be AWS. It points to just how easy it is to make these errors. I am sure that the SRE who made this mistake is amazing and competent and just had one bad moment.While I hope that AWS would be as understanding as Gitlab, I doubt the outcome is the same.

评论 #13777977 未加载

评论 #13776993 未加载

评论 #13777781 未加载

rkuykendall-comabout 8 years ago

tl;dr: Engineer fat-fingered a command and shut everything down. Booting it back up took a long time. Then the backlog was huge, so getting back to normal took even longer. We made the command safer, and are gonna make stuff boot faster. Finally, we couldn’t report any of this on the service status dashboard, because we’re idiots, and the dashboard runs on AWS.

评论 #13776710 未加载

all_usernamesabout 8 years ago

Overall, it's pretty amazing that the recovery was as fast as it was. Given the throughput of S3 API calls you can imagine the kind of capacity that's needed to do a full stop followed by a full start. Cold-starting a service when it has heavy traffic immediately pouring into it can be a nightmare.It'd be very interesting to know what kind of tech they use at AWS to throttle or do circuit breaking to allow back-end services like the indexer to come up in a manageable way.

hyperanthonyabout 8 years ago

Something that wasn't addressed -- there seems to be an architectural issue with ELB where ELBs with S3 access logs enabled had instances fail ELB health checks, presumably while the S3 API was returning 5XX. My load balancers in us-east-1 without access logs enabled were fine throughout this event. Has there been any word on this?

评论 #13776441 未加载

djhworldabout 8 years ago

Really pleased to see this, it's good to see an organisation that's being transparent (and maybe given us a little peek under the hood of how S3 is architected) and most importantly they seem quite humbled.It would be easy for an arrogant organisation to fire or negatively impact the person that made the mistake, I hope Amazon don't fall into that trap and focus instead on learning from what happened, closing the book and move on.

mmanfrinabout 8 years ago

There are quite a few comments here ignoring the clarity that hindsight is giving them. Apparently the devops engineers commenting here have never fucked up.

评论 #13780982 未加载

certifiedloudabout 8 years ago

This is a bit off topic. The use of the word "playbook" suggests to me that they use Ansible to help manage S3. I wonder if that is the case, or if it's just internal lingo that means "a script". Unless there is some other configuration management system that uses the word playbook that I'm not aware of.

评论 #13776003 未加载

评论 #13776239 未加载

评论 #13776195 未加载

erikbyeabout 8 years ago

What does everyone use S3 for?I'm genuinely curious. As my experiments with it have left me disappointed with its performance, I'm just not sure what I could use it for. Store massive amounts of data that is infrequently accessed? Well, unfortunately the upload speed I got to the standard rating one was so abysmal it would take too much time to move the data there; and then I suspect the inverse would be pretty bad as well.

评论 #13776886 未加载

评论 #13776762 未加载

评论 #13776658 未加载

评论 #13776788 未加载

评论 #13778791 未加载

i336_about 8 years ago

> (...) [W]e have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.All those tweets saying "turn it off and back on again"?"We accidentally turned it off, but it hasn't been turned it off for so long it took us hours to figure out how to turn it back on."Poorly-presented jokes aside, this is rather concerning. The indexer and placement systems are SPOFs!! I mean, I'd presume these subsystems had ultra-low-latency hot failover, but this says they never restarted, and I wonder if AWS didn't simply invest a ton of magic pixie dust in making Absolutely Totally Sure™ the subsystems physically, literally never crashed in years. Impressive engineering but also very scary.At least they've restarted it now.And I'm guessing the current hires now know a lot about the indexer and placer, which won't do any harm to the sharding effort (I presume this'll be being sharded quicksmart).I wonder if all the approval guys just photocopied their signatures onto a run of blank forms, heheh.

评论 #13779700 未加载

cnorthwoodabout 8 years ago

I'm surprised how transparent this is, I can find Amazon often a bit opaque when dealing with issues.

评论 #13776068 未加载

评论 #13776187 未加载

评论 #13776464 未加载

nissimkabout 8 years ago

I keep being reminded of something I read recently that made me feel uneasy about google's cloud spanner [1]:the most important one is that Spanner runs on Google’s private network. Unlike most wide-area networks, and especially the public internet, Google controls the entire network and thus can ensure redundancy of hardware and paths, and can also control upgrades and operations in general. Fibers will still be cut, and equipment will fail, but the overall system remains quite robust. It also took years of operational improvements to get to this point. For much of the last decade, Google has improved its redundancy, its fault containment and, above all, its processes for evolution. We found that the network contributed less than 10% of Spanner’s already rare outages.But when it fails it's going to be epic![1] <a href="https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Spanner-and-the-CAP-Theorem.html" rel="nofollow">https://cloudplatform.googleblog.com/2017/02/inside-Cloud-Sp...</a>

评论 #13776725 未加载

St-Clockabout 8 years ago

I am unpleasantly surprised that they do not mention why services that should be unrelated to S3 such as SES were impacted as well and what they are doing to reduce such dependencies.From a software development perspective, it makes sense to reuse S3 and rely on it internally if you need object storage, but from an ops perspective, it means that S3 is now a single point of failure and that SES's reliability will always be capped by S3's reliability. From a customer perspective, the hard dependency between SES and S3 is not obvious and is disappointing.The whole internet was talking about S3 when the AWS status dashboard did not show any outage, but very few people mentioned other services such as SES. Next time we encounter errors with SES, should we check for hints of S3 outage before everything else? Should we also check for EC2 outage?

评论 #13776805 未加载

tifa2upabout 8 years ago

Do you know if Amazon is giving any refunds/credits for the service outbreak?

评论 #13783308 未加载

ct0about 8 years ago

I wouldn't want to be the person who wrote the wrong command! Sheesh.

评论 #13775918 未加载

评论 #13776056 未加载

评论 #13776026 未加载

评论 #13775925 未加载

评论 #13775932 未加载

评论 #13777697 未加载

EdSharkeyabout 8 years ago

It's curious they needed to "remove capacity" to cure a slow billing problem.Is that code for a "did you try to reboot the system?" kind of troubleshooting?It sounds to me like the authorized engineer sent a command to reboot/reimage a large swath of the S3 infrastructure.

sebringjabout 8 years ago

If Amazon were a guy, he'd be a standup guy. This is a very detailed and responsible explanation. S3 has revolutionized my businesses and I love that service to no end. These problems happen very rarely but I may have backups just in case using nginx proxy approach at some point and because S3 is so good, everyone seems to adopt their API so its just a matter of a switch statement. Werner can sweat less. Props.I would add, it would be awesome if there was a simulation environment, beyond just a test environment that simulated servers outside requesting in, before a command was allowed to run onto production, like a robot deciding this, then could mitigate this, kind of like TDD on steriods if they don't have that already.

Exumaabout 8 years ago

Imagine being THAT guy.......... in that exact moment...... after hitting enter and realizing what he did. RIP

评论 #13777485 未加载

评论 #13781779 未加载

评论 #13777034 未加载

评论 #13777693 未加载

lasermike026about 8 years ago

Ops and Engineering here.My guts hurt just reading this.With big failures is never just one thing. There are a series of mistakes, bad choices, and ignorance that lead to a big system wide failures.

spullaraabout 8 years ago

Twitter once had 2 hours of downtime because an operations engineer accidentally asked a tool to restart all memcached servers instead of a certain server. The tool was then changed to make sure that you couldn't restart more than a few servers without additional confirmation. Sounds very similar to this situation. Something to think about when you are building your tools to be more error proof.

评论 #13778538 未加载

评论 #13781264 未加载

matt_wulfeckabout 8 years ago

>Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.We have geo-distributed systems. Load balancing and automatic failover. We agonize over edge cases that might cause issues. We build robust systems.At the end of he day reliability -- a lot like security -- is most affected by the human factor.

dapabout 8 years ago

> Removing a significant portion of the capacity caused each of these systems to require a full restart.I'd be interested to understand why a cold restart was needed in the first place. That seems like kind of a big deal. I can understand many reasons why it might be necessary, but that seems like one of the issues that's important to address.

评论 #13777934 未加载

OhHeyItsEabout 8 years ago

I find this refreshingly candid; human, even, for AWS.

throwtothewayabout 8 years ago

I hope I never have to write a post-mortem that includes the phrase "blast radius"

bandramiabout 8 years ago

"we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years."Yeah... nothing says "resilience" quite like that...

aestetixabout 8 years ago

It sounds like this can be mitigated by making sure everything is run in dry run mode first, and for something mission critical, getting it double-checked by someone before removing the dry run constraint.It's good practice in general, and I'm kind of astonished it's not part of the operational procedures in AWS, as this would have quickly been caught and fixed before ever going out to production.

评论 #13777242 未加载

carlsborgabout 8 years ago

"As a result, (personal experience and anecdotal evidence suggest that) for complex continuously available systems, Operations Error tends to be the weakest link in the uptime chain."<a href="https://zvzzt.wordpress.com/2012/08/16/a-note-on-uptime/" rel="nofollow">https://zvzzt.wordpress.com/2012/08/16/a-note-on-uptime/</a>

dorianmabout 8 years ago

Same as that db1 / db2 for GitLab, naming things is pretty important (E.g production / staging, production-us-east-1-db-560 etc.)

pfortunyabout 8 years ago

I guess it is time to define commands whose inputs have great distance in, say the Damerau-Levenshtein metric.For numerical inputs, one might use both the digits and the textual expression. This would make them quite cumbersone but much less prone to errors. Or devise some shorthand for them...156 (on fi six). 35. (zer th fi). 170 (on se zer). 28 (two eig) evens have three letters odds have two.This is just my 2 cents.

atrudeauabout 8 years ago

Are customers going to receive any kind of future credit because of this? Would be a nice band-aid after such a hard smack on the head.

评论 #13776305 未加载

mulmenabout 8 years ago

If this boils down to an engineer incorrectly entering a command can we please refer to this outage as "Fat Finger Tuesday"?

jasonhoytabout 8 years ago

"People make mistakes all the time...the problem was that our systems that were designed to recognize and correct human error failed us." [1][1] <a href="http://articles.latimes.com/1999/oct/01/news/mn-17288" rel="nofollow">http://articles.latimes.com/1999/oct/01/news/mn-17288</a>

评论 #13776576 未加载

hemant19cseabout 8 years ago

Amazon's AWS Outage Was More of a Design(UX) Failure and Less of Human Error. <a href="https://www.linkedin.com/pulse/how-small-typo-caused-massive-downtime-s3aws-hemant-kumar-singh" rel="nofollow">https://www.linkedin.com/pulse/how-small-typo-caused-massive...</a>

bsaulabout 8 years ago

Wonder if every numbers for critical command lines shouldn't be spelled out as well. If you think about how checks works, you're supposed to write the number as well as the words for the number. -nbs two_hundreds instead of twenty is much less likely to happen..just like rm -rf / should really be rm -rf `root`

eplanitabout 8 years ago

So this week the poor soul at Amazon, along with the Price-Waterhouse guy, are the poster children of Human Error.

rsynnottabout 8 years ago

> While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.This is the bit that'd worry me most; you'd think they'd be testing this.

评论 #13776550 未加载

EternalDataabout 8 years ago

This caused panic and chaos for a bit among my team, which I imagine was replicated across the web.Moments like these always remind me that a particularly clever or nefarious set of individuals could shut down essential parts of the Internet with a few surgical incisions.

DanBlakeabout 8 years ago

Seems like something like Chaos monkey should have been able to predict and mitigate a issue like this. Im actually curious if anyone uses it at all- Curious if anyone in here at a large company (over 500 employees) has it deployed or not.

matt_wulfeckabout 8 years ago

Remember folks, automate your systems but never forget to add sanity checks.

tuxninjaabout 8 years ago

I think they should have led with insensitivity about it and maybe a white lie. Such as... We took our main region us-east-1 down for X hours because we wanted to remind people they need to design for failure of a region :-)Shameless plugs (authored months ago): <a href="http://tuxlabs.com/?p=380" rel="nofollow">http://tuxlabs.com/?p=380</a> - How To: Maximize Availability Effeciently Using AWS Availability Zones ( note read it, its not just about AZ's it is very clear to state multi-regions and better yet multi-cloud segway...second article) <a href="http://tuxlabs.com/?p=430" rel="nofollow">http://tuxlabs.com/?p=430</a> - AWS, Google Cloud, Azure and the singularity of the future Internet

nsgoetzabout 8 years ago

This makes me want to write a program that would ask users to confirm commands if it thinks they are running a known playbook and deviating from it. Does anyone know if a tool like that exists?

评论 #13780756 未加载

prh8about 8 years ago

For as much as people jumped all over Gitlab last month, this seems remarkably similar in terms of preparedness for accidental and unanticipated failure.

评论 #13776910 未加载

评论 #13776632 未加载

chirauabout 8 years ago

Deletions, shutdowns and replications should always either contain a SELECT to show affected entities or a confirmation (Y/n) of the step.

pwthorntonabout 8 years ago

This is the risk you run into by doing everything through the command line. This would be really hard to do through a good GUI.

评论 #13777245 未加载

评论 #13778014 未加载

评论 #13779070 未加载

fulafelabout 8 years ago

I guess this means it's much better for your app to fail over between regions than between availability zones.

asow92about 8 years ago

Man, I'd really hate to be that guy.

tn13about 8 years ago

Can Amazon take responsibility and offer say 10% discount to all the customers who are spending >$X ?

评论 #13778130 未加载

sumobobabout 8 years ago

Wow, I'm sure that person who mis entered the command will never, ever, ever, do it again.

dootdootskeltalabout 8 years ago

I don't know if it's a C thing, but those code comments are art!

评论 #13780506 未加载

CodeWriter23about 8 years ago

I'm just going to call this "PEBKAC at scale"

feiskyabout 8 years ago

Great shock to the quick recovery of such a big system.

cagataygurturkabout 8 years ago

Sounds so chernobyl.

评论 #13779119 未加载

davidf18about 8 years ago

There should be some sort of GUI interface that does appropriate checks instead of allowing someone to mistakenly type the correct information.

njharmanabout 8 years ago

Did you read the fucking article?That is EXACTLY what they are doing (among other things).

评论 #13780331 未加载

skrowlabout 8 years ago

TLDR:Never type 'EXEC DeleteStuff ALL'When you actually mean 'EXEC DeleteStuff SOME'

thraway2016about 8 years ago

Something doesn't pass the smell test. Over two hours to reboot the index hosts?

评论 #13778526 未加载

评论 #13779104 未加载

romanovcodeabout 8 years ago

They could've just not post anything. People already forgot about this disruption.

machbioabout 8 years ago

No on in HN is questioning this - "The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected." - they are debugging on Production System..

fr4egy8e9about 8 years ago

What most AWS customers don't realize is that AWS is poorly automated. Their reliability relies on exploiting the employees to manually operate the systems. The technical bar at Amazon is incredibly low and they can't retain any good engineers.

aorloffabout 8 years ago

What's missing is addressing the problems with their status page system, and how we all had to use Hacker News and other sources to confirm that US East was borked.

评论 #13775977 未加载

评论 #13776119 未加载

tuxninjaabout 8 years ago

I think they should have led with insensitivity about it and maybe a white lie. Such as... We took our main region us-east-1 down for X hours because we wanted to remind people they need to design for failure of a region :-)Shameless plugs (authored months ago): <a href="http://tuxlabs.com/?p=380" rel="nofollow">http://tuxlabs.com/?p=380</a> - How To: Maximize Availability Effeciently Using AWS Availability Zones ( note read it, its not just about AZ's it is very clear to state multi-regions and better yet multi-cloud segway...second article)<a href="http://tuxlabs.com/?p=430" rel="nofollow">http://tuxlabs.com/?p=430</a> - AWS, Google Cloud, Azure and the singularity of the future Internet

edutechnionabout 8 years ago

For the many of us who have built businesses dependent on S3, is anyone else surprised at a few assumptions embedded here?* "authorized S3 team member" -- how did this team member acquire these elevated privs?* Running playbooks is done by one member without a second set of eyes or approval?* "we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years"The good news:* "The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately."The truly embarrassing that everyone has known about for years is the status page:* "we were unable to update the individual services’ status on the AWS Service Health Dashboard "When there is a wildly-popular Chrome plugin to fix your page ("Real AWS Status") you would think a company as responsive as AWS would have fixed this years ago.

评论 #13776122 未加载

评论 #13776127 未加载

评论 #13776129 未加载

评论 #13776131 未加载

评论 #13776164 未加载

评论 #13779133 未加载

73 comments

ajrossabout 8 years ago

评论 #13776717 未加载

评论 #13776231 未加载

评论 #13775966 未加载

评论 #13776675 未加载

评论 #13776465 未加载

评论 #13776320 未加载

评论 #13776780 未加载

评论 #13776289 未加载

评论 #13776574 未加载

评论 #13780193 未加载

评论 #13777388 未加载

评论 #13781543 未加载

评论 #13778840 未加载

评论 #13779628 未加载

评论 #13784604 未加载

评论 #13778472 未加载

评论 #13776095 未加载

评论 #13779005 未加载

评论 #13779074 未加载

评论 #13777497 未加载

评论 #13777576 未加载

评论 #13779675 未加载

评论 #13776704 未加载

评论 #13776059 未加载

conorhabout 8 years ago

评论 #13776615 未加载

评论 #13776045 未加载

评论 #13780144 未加载

评论 #13779935 未加载

jph00about 8 years ago

评论 #13776586 未加载

评论 #13776065 未加载

评论 #13776979 未加载

评论 #13778954 未加载

评论 #13779730 未加载

losteverythingabout 8 years ago

评论 #13777086 未加载

评论 #13778037 未加载

评论 #13777193 未加载

评论 #13776715 未加载

评论 #13776785 未加载

savanalyabout 8 years ago

评论 #13776697 未加载

评论 #13776233 未加载

评论 #13777215 未加载

评论 #13776492 未加载

westernmostcoyabout 8 years ago

评论 #13777842 未加载

评论 #13777308 未加载

评论 #13777219 未加载

评论 #13778776 未加载

seanwilsonabout 8 years ago

评论 #13776652 未加载

评论 #13778509 未加载

dsr_about 8 years ago

评论 #13778221 未加载

评论 #13784929 未加载

评论 #13777609 未加载

magdabout 8 years ago

Oh, that interview question. “Tell me about something you broke in your last job"

评论 #13780348 未加载

评论 #13776823 未加载

评论 #13779383 未加载

评论 #13779998 未加载

评论 #13780599 未加载

评论 #13776214 未加载

orthecreedenceabout 8 years ago

TLDR; Someone on the team ran a command by mistake that took everything down. Good, detailed description. It happens. Out of all of Amazon's offerings, I still love S3 the most.

评论 #13776237 未加载

评论 #13776227 未加载

idlewordsabout 8 years ago

"we have changed the SHD administration console to run across multiple AWS regions."Dear Amazon: please lease a $25/month dedicated server to host your status page on.

评论 #13777428 未加载

评论 #13779515 未加载

mleonhardabout 8 years ago

Dangerangerabout 8 years ago

评论 #13777977 未加载

评论 #13776993 未加载

评论 #13777781 未加载

rkuykendall-comabout 8 years ago

评论 #13776710 未加载

all_usernamesabout 8 years ago

hyperanthonyabout 8 years ago

评论 #13776441 未加载

djhworldabout 8 years ago

mmanfrinabout 8 years ago

There are quite a few comments here ignoring the clarity that hindsight is giving them. Apparently the devops engineers commenting here have never fucked up.

评论 #13780982 未加载

certifiedloudabout 8 years ago

评论 #13776003 未加载

评论 #13776239 未加载

评论 #13776195 未加载

erikbyeabout 8 years ago

评论 #13776886 未加载

评论 #13776762 未加载

评论 #13776658 未加载

评论 #13776788 未加载

评论 #13778791 未加载

i336_about 8 years ago

评论 #13779700 未加载

cnorthwoodabout 8 years ago

I'm surprised how transparent this is, I can find Amazon often a bit opaque when dealing with issues.

评论 #13776068 未加载

评论 #13776187 未加载

评论 #13776464 未加载

nissimkabout 8 years ago

评论 #13776725 未加载

St-Clockabout 8 years ago

评论 #13776805 未加载

tifa2upabout 8 years ago

Do you know if Amazon is giving any refunds/credits for the service outbreak?

评论 #13783308 未加载

ct0about 8 years ago

I wouldn't want to be the person who wrote the wrong command! Sheesh.

评论 #13775918 未加载

评论 #13776056 未加载

评论 #13776026 未加载

评论 #13775925 未加载

评论 #13775932 未加载

评论 #13777697 未加载

EdSharkeyabout 8 years ago

sebringjabout 8 years ago

Exumaabout 8 years ago

Imagine being THAT guy.......... in that exact moment...... after hitting enter and realizing what he did. RIP

评论 #13777485 未加载

评论 #13781779 未加载

评论 #13777034 未加载

评论 #13777693 未加载

lasermike026about 8 years ago

spullaraabout 8 years ago

评论 #13778538 未加载

评论 #13781264 未加载

matt_wulfeckabout 8 years ago

dapabout 8 years ago

评论 #13777934 未加载

OhHeyItsEabout 8 years ago

I find this refreshingly candid; human, even, for AWS.

throwtothewayabout 8 years ago

I hope I never have to write a post-mortem that includes the phrase "blast radius"

bandramiabout 8 years ago

"we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years."Yeah... nothing says "resilience" quite like that...

aestetixabout 8 years ago

评论 #13777242 未加载

carlsborgabout 8 years ago

dorianmabout 8 years ago

Same as that db1 / db2 for GitLab, naming things is pretty important (E.g production / staging, production-us-east-1-db-560 etc.)

pfortunyabout 8 years ago

atrudeauabout 8 years ago

Are customers going to receive any kind of future credit because of this? Would be a nice band-aid after such a hard smack on the head.

评论 #13776305 未加载

mulmenabout 8 years ago

If this boils down to an engineer incorrectly entering a command can we please refer to this outage as "Fat Finger Tuesday"?

jasonhoytabout 8 years ago

评论 #13776576 未加载

hemant19cseabout 8 years ago

bsaulabout 8 years ago

eplanitabout 8 years ago

So this week the poor soul at Amazon, along with the Price-Waterhouse guy, are the poster children of Human Error.

rsynnottabout 8 years ago

评论 #13776550 未加载

EternalDataabout 8 years ago

DanBlakeabout 8 years ago

matt_wulfeckabout 8 years ago

Remember folks, automate your systems but never forget to add sanity checks.

tuxninjaabout 8 years ago

nsgoetzabout 8 years ago

This makes me want to write a program that would ask users to confirm commands if it thinks they are running a known playbook and deviating from it. Does anyone know if a tool like that exists?

评论 #13780756 未加载

prh8about 8 years ago

For as much as people jumped all over Gitlab last month, this seems remarkably similar in terms of preparedness for accidental and unanticipated failure.

评论 #13776910 未加载

评论 #13776632 未加载

chirauabout 8 years ago

Deletions, shutdowns and replications should always either contain a SELECT to show affected entities or a confirmation (Y/n) of the step.

pwthorntonabout 8 years ago

This is the risk you run into by doing everything through the command line. This would be really hard to do through a good GUI.

评论 #13777245 未加载

评论 #13778014 未加载

评论 #13779070 未加载

fulafelabout 8 years ago

I guess this means it's much better for your app to fail over between regions than between availability zones.

asow92about 8 years ago

Man, I'd really hate to be that guy.

tn13about 8 years ago

Can Amazon take responsibility and offer say 10% discount to all the customers who are spending >$X ?

评论 #13778130 未加载

sumobobabout 8 years ago

Wow, I'm sure that person who mis entered the command will never, ever, ever, do it again.

dootdootskeltalabout 8 years ago

I don't know if it's a C thing, but those code comments are art!

评论 #13780506 未加载

CodeWriter23about 8 years ago

I'm just going to call this "PEBKAC at scale"

feiskyabout 8 years ago

Great shock to the quick recovery of such a big system.

cagataygurturkabout 8 years ago

Sounds so chernobyl.

评论 #13779119 未加载

davidf18about 8 years ago

There should be some sort of GUI interface that does appropriate checks instead of allowing someone to mistakenly type the correct information.

njharmanabout 8 years ago

Did you read the fucking article?That is EXACTLY what they are doing (among other things).

评论 #13780331 未加载

skrowlabout 8 years ago

TLDR:Never type 'EXEC DeleteStuff ALL'When you actually mean 'EXEC DeleteStuff SOME'

thraway2016about 8 years ago

Something doesn't pass the smell test. Over two hours to reboot the index hosts?

评论 #13778526 未加载

评论 #13779104 未加载

romanovcodeabout 8 years ago

They could've just not post anything. People already forgot about this disruption.

machbioabout 8 years ago

fr4egy8e9about 8 years ago

aorloffabout 8 years ago

What's missing is addressing the problems with their status page system, and how we all had to use Hacker News and other sources to confirm that US East was borked.

评论 #13775977 未加载

评论 #13776119 未加载

tuxninjaabout 8 years ago

I think they should have led with insensitivity about it and maybe a white lie. Such as... We took our main region us-east-1 down for X hours because we wanted to remind people they need to design for failure of a region :-)Shameless plugs (authored months ago): <a href="http://tuxlabs.com/?p=380" rel="nofollow">http://tuxlabs.com/?p=380</a> - How To: Maximize Availability Effeciently Using AWS Availability Zones ( note read it, its not just about AZ's it is very clear to state multi-regions and better yet multi-cloud segway...second article)<a href="http://tuxlabs.com/?p=430" rel="nofollow">http://tuxlabs.com/?p=430</a> - AWS, Google Cloud, Azure and the singularity of the future Internet