I want to have an AWS region where everything breaks with high frequency

792 点作者 caiobegotti将近 5 年前

48 条评论

jedberg将近 5 年前

For those saying "Chaos Engineering", first off, the poster is well aware of Chaos Engineering. He's an AWS Hero and the founder of Tarsnap.Secondly, this would help make CE better. I actually asked Amazon for an API to do this ten years ago when I was working on Chaos Monkey.I asked for an API to do a hard power off of an instance. To this day, you can only do a graceful power off. I want to know what happens when the instance just goes away.I also asked for an API to slow down networking, set a random packet drop rate, EBS failures, etc. All of these things can be simulated with software, but it's still not exactly the same as when it happens outside the OS.Basically I want an API where I can torture an EC2 instance to see what happens to it, for science!

评论 #24114362 未加载

评论 #24114026 未加载

评论 #24113893 未加载

评论 #24112995 未加载

评论 #24118383 未加载

评论 #24113636 未加载

评论 #24113127 未加载

评论 #24117646 未加载

评论 #24112990 未加载

评论 #24114844 未加载

评论 #24113795 未加载

评论 #24113288 未加载

dijit将近 5 年前

Isn’t us-east-1 exactly that?All jokes aside, I actually asked my google cloud rep about stuff like this; they came back with some solutions but often the problem with that is, what kind of failure condition are you hoping for?Zonal outage (networking)? Hypervisor outage? Storage outage?Unless it’s something like s3 giving high error rates then most things can actually be done manually. (And this was the advice I got back because faulting the entire set of apis and tools in unique and interesting ways is quite impossible)

评论 #24113632 未加载

评论 #24113469 未加载

评论 #24116407 未加载

davidrupp将近 5 年前

[Disclaimer: I work as a software engineer at Amazon (opinions my own, obvs)]The chaos aspect of this would certainly increase the evolutionary pressure on your systems to get better. You would need really good visibility into what exactly was going on at the time your stuff fell over, so you could know what combination(s) to guard against next time. But there is definitely a class of problems this would help you discover and solve.The problem with the testing aspect, though, is that test failures are most helpful when they're deterministic. If you could dictate the type, number, and sequence of specific failures, then write tests (and corresponding code) that help make your system resilient to that combination, that would definitely be useful. It seems like "us-fail-1" would be more helpful for organic discovery of failure conditions, less so for the testing of specific conditions.

评论 #24114269 未加载

评论 #24116454 未加载

gregdoesit将近 5 年前

When I worked at Skype / Microsoft and Azure was quite young, the Data team next to me had a close relationship with one of the Azure groups who were building new data centers.The Azure group would ask them to send large loads of data their way, so they could get some "real" load on the servers. There would be issues at the infra level, and the team had to detect this and respond to it. In return, the data team would also ask the Azure folks to just unpug a few machines - power them off, take out network cables - helping them test what happens.Unfortunately, this was a one-off, and once the data center was stable, the team lost this kind of "insider" connection.Howerver, as a fun fact, at Skype, we could use Azure for free for about a year - every dev in the office, for work purposes (including work pet projects). We spun up way too many instances during time, as you'd expect, and only came around to turning them off when Azure changed billing to charge 10% of the "regular" pricing for internal customers.

评论 #24117726 未加载

bob1029将近 5 年前

It sounds to me what some people would like is for a magical box they can throw their infrastructure into that will automatically shit test all the things that could potentially go wrong for them. This is poor engineering. Arbitrary, contrived error conditions do not constitute a rational test fixture. If you are not already aware of where failures might arise in your application and how to explicitly probe those areas, you are gambling at best. Not all errors are going to generate stack traces, and not all errors are going to be detectable by your users. What you would consider an error condition for one application may be a completely acceptable outcome for another.This is the reliability engineering equivalent of building a data warehouse when you don't know what sorts of reports you want to run or how the data will generally be used after you collect it.

评论 #24114154 未加载

评论 #24117006 未加载

评论 #24114970 未加载

ben509将近 5 年前

I don't see a us-fail-1 region being set up for a number of reasons.One, this is not how AWS regions are designed to work. What they're thinking of is a virtual region with none of its own datacenters, but AWS has internal assumptions about what a region is that are baked into their codebase. I think it would be a massive undertaking to simulate a region like this.(I don't think a fail AZ would work either, arguably it'd be worse because all the code that automatically enumerates AZs would have to skip it, which is going to be all over the place.)Two, set up a region with deliberate problems, and idiots will run their production workload in it. It doesn't matter how many banners and disclaimers you set up on the console, they'll click past them.When customer support points out they shouldn't be doing this, the idiot screams at them, "but my whole business is down! You have to DO something!" This would be a small number of customers, but the support guys get all of them.Three, AWS services depend on other AWS services. There are dozens of AWS services, each like little companies with varying levels of maturity. They ought to design all their stuff to gracefully respond to outages, but they have business priorities and many services won't want to set up in us-fail-1. When a region adds special constraints, it has a high likelihood of being a neglected region like GovCloud.

falcolas将近 5 年前

I don't work with the group directly, but one group at our company has set up Gremlin, and the breadth and depth of outages Gremlin can cause is pretty impressive. Chaos Testing FTW.

评论 #24114735 未加载

jiggawatts将近 5 年前

Along the same vein, instead of the typical "debug" and "release" configurations in compilers, I'd love it if there was also an "evil" configuration.The evil configuration should randomise anything that isn't specified. No string comparison type selected? You get Turkish. All I/O and networking operations fail randomly. Any exception that can be thrown, is, at some small rate.Or to take things to the next level, I'd love it if every language had an interpreted mode similar to Rust's MIR interpreter. This would tag memory with types, validate alignment requirements, enforce the weakest memory model (e.g.: ARM rules even when running on Intel), etc...

msla将近 5 年前

A zone not only of sight and sound, but of CPU faults and RAM errors, cache inconsistency and microcode bugs. A zone of the pit of prod's fears and the peak of test's paranoia. Look, up ahead: Your root is now read-only and your page cache has been mapped to /dev/null! You're in the Unavailability Zone!

评论 #24112694 未加载

missosoup将近 5 年前

That region is called Microsoft Azure. It will even break the control UI with high frequency.

评论 #24113692 未加载

评论 #24113918 未加载

rob-olmos将近 5 年前

I imagine AWS and other clouds have a staging/simulation environment for testing their own services. I seem to recall them discussing that for VPC during re:Invent or something.I'm on the fence though if I would want a separate region for this with various random failures. I think I'd be more interested in being able to inject faults/latencies/degradation in existing regions, and when I want them to happen for more control and ability to verify any fixes.Would be interesting to see how they price it as well. High per-API cost depending on the service being affected, combined with a duration. Eg, make these EBS volumes 50% slower for the next 5min.Then after or in tandem with the API pieces, release their own hosted Chaos Monkey type service.

bigiain将近 5 年前

Show HN! Introducing my new SPaaS:Unreliability.io - Shitty Performance as a Service.We hook your accounting software up to api.unreliability.io and when a client account becomes delinquent, our platform instantly migrates their entire stack into the us-fail-1 region. Automatically migrates back again within 10 working days after full payment has cleared - guaranteed downtime of no less than 4 hours during migration back to production region. Register now for a 30 day Free Trial!

kentlyons将近 5 年前

I want this at the programming language level too. If a function call can fail, I want to set a flag and have it (randomly?) fail. I hacked my way around this by adding in some wrapper that would if random, err for a bunch of critical functions. It was great for working through a ton of race conditions in golang with channels, and remote connections, etc. But hacking it in manually was annoying and not something I'd want to commit.

imhoguy将近 5 年前

Failing individual computes isn't hard, some chaos script to kill VMs is enough. Worst are situations when things seem to be up but not acceptable: abnormal network latency, random packet drops, random but repeatable service errors, lagging eventual consistency. Not even mentioning any hardware woes.

vemv将近 5 年前

While these are not exclusive, personally I'd look instead into studying my system's reliability in a way that is independent of a cloud provider, or even of performing any side-effectful testing at all.There's extensive research and works on all things resilience. One could say: if one build a system that is proven to be theoretically resilient, that model should extrapolate to real-world resilience.This approach is probably intimately related with pure-functional programming, which I feel has been not explored enough in this area.

terom将近 5 年前

There are multiple methods for automating AWS EC2 instance recovery for instances in the "system status check failed" or "scheduled for retirement event" cases.Yet to figure out how to test any of those cloudwatch alerts/rules. I've had them deployed in my dev/test environments for months now, after having to manually deal with a handful of them in a short time period. They've yet to trigger once since.Umbrellas when it's raining etc.

评论 #24112816 未加载

swasheck将近 5 年前

Wait. I thought this was ap-southeast-2

评论 #24116492 未加载

MattGaiser将近 5 年前

Whichever region Quora is using.

haecceity将近 5 年前

Why does Twitter often fail to load when I open a thread and if I refresh it works. Does Twitter use us-fail-1?

评论 #24113663 未加载

评论 #24114012 未加载

评论 #24115854 未加载

评论 #24113834 未加载

georgewfraser将近 5 年前

I think people overestimate the importance of failures of the underlying cloud platform. One of the most surprising lessons of the last 5 years at my company has been how rarely single points of failure actually fail. A simple load-balanced group of EC2 instances, pointed at a single RDS Postgres database, is astonishingly reliable. If you get fancy and build a multi-master system, you can easily end up creating more downtime than you prevent when your own failover/recovery system runs amok.

kevindong将近 5 年前

At my job, my team owns a service that generally has great uptime. Dependent teams/services have gotten into the habit of assuming that our service will be 100% available which is problematic because it's obviously not. That false assumption has caused several minor incidents unfortunately.There have been some talk internally of doing chaos engineering to help improve the reliability of our company's products as a whole. Unfortunately, the most easily simulatable failure scenarios (e.g. entire containers go down at once instantly, etc.) tend to be the least helpful since my team designed the service to tolerate those kinds of easily modelable situations.The more subtle/complex/interesting failure conditions are far harder to recognize and simulate (e.g. all containers hosted on one particular node experience 10s latencies on all network traffic, stale DNS entries, broken service discovery, etc.).

thethethethe将近 5 年前

You can just do this yourself. Google breaks it's systems intentionally for a week every year, it's called DiRT week. DiRT takes weeks of planning before people even start debugging.Doing this constantly for a all products in a single region would be absolutely exhausting for SRE teams(Discliamer: I work for ^GOOG and my opinions are my own)

评论 #24117728 未加载

评论 #24117838 未加载

djhaskin987将近 5 年前

Friendly reminder that for any given single availability zone, the SLA that AWS provides is one single nine. That means that they expect that availability zone to fail 10% of the time, or 6 minutes every hour. this very high failure rate comes up solutely for free, no need for a special region. Therefore, implementing a cross availability zone application that logs when packets are dropped should give you some idea of how your application handles failure.

评论 #24119307 未加载

gberger将近 5 年前

Relevant snippet in the Google SRE Book:<a href="https://landing.google.com/sre/sre-book/chapters/service-level-objectives/#xref_risk-management_global-chubby-planned-outage" rel="nofollow">https://landing.google.com/sre/sre-book/chapters/service-lev...</a>Google introduced exactly this to one of their internal services so that downstream dependencies can't rely on its extremely high availability.

t0mek将近 5 年前

Toxiproxy [1] is a tool that allows to create network tunnels with random network problems like high latency, packet drops or slicing, timeouts, etc.Setting it up requires some effort (you can't just choose a region in your AWS config), but it's available now and can be integrated with tests.[1] <a href="https://github.com/Shopify/toxiproxy" rel="nofollow">https://github.com/Shopify/toxiproxy</a>

chirag64将近 5 年前

No one seems to talk about the pricing aspect of this. Developers would want these us-fail-1 regions to be cheap or free since they wouldn't be using this for production purposes. And before you know it, a lot of hobbyist developers will start using these as their production setup since they wouldn't mind a 1% downtime if they could pay less for it.

评论 #24118088 未加载

评论 #24119301 未加载

exabrial将近 5 年前

Simply host on Google Cloud! They will terminate your access for something random, like someone said your name on YouTube while doing something bad! They don't have a number you can call, and their support is ran but the stupidest of all AI algorithms.

raverbashing将近 5 年前

There's an easier way: spot instances (and us-east-1 as mentioned)As for things like EBS failing, or dropping packages, it's a bit tricky as some things might break at the OS levelsAnd given sufficient failures, you can't swim anymore, you'll just sink.

castratikron将近 5 年前

Sometimes during development instead of checking the return code I'll check rand() % 3 or something similar. I'll run through the code several times in a loop and run through a lot of the failure modes very quickly this way.

exabrial将近 5 年前

Sort of counter-intuitive, but for small projects, you want resilient hardware systems as much as possible... the larger your scale out, the less reliable you would want them to force that out of hardware into resilient software.

jonplackett将近 5 年前

This is such a clever idea. I wonder if amazon are smart enough to actually do this.

foota将近 5 年前

Just deploy a new region with no ops support, it'll quickly become that.

thoraway1010将近 5 年前

A great idea! I'd love to run stuff in this zone. Rotate through a bunch of errors, unavailability, latency spikes, power outages etc every day, make it a 12 hour torture test cycle.

martin-adams将近 5 年前

I can see a use case for this being implemented on top of Kubernetes. I've no idea if that's achievable, but could go some way to make your code more resilient.

bootyfarm将近 5 年前

I believe this is available as a service called “Softlayer”

chucky_z将近 5 年前

it's us-west-1! :Dwe've had a ton of instances fail at once because they had some kind of rack-level failure and a bunch of our EC2s ended up in the same rack. :(

SisypheanLife将近 5 年前

This would require AWS to invest in Chaos Monkeys.

bootyfarm将近 5 年前

I believe this is a service called “SoftLayer”

jschulenklopper将近 5 年前

That region should have `us-wtf-1` as code.

whoisjuan将近 5 年前

Isn't this what Gremlin does?

6510将近 5 年前

Sounds useful. Crank it up to 99% failure and it becomes interesting science.

评论 #24110990 未加载

mamon将近 5 年前

Try us-east-1 :)

rdoherty将近 5 年前

This is called chaos engineering and many companies built tooling to do exactly this. Netflix pioneered/proselytized it years ago. Since you likely don't just rely upon AWS services if your app is in AWS, you want something either on your servers themselves or built into whatever low level HTTP wrapper you use. Use that library to do fault injection like high latency, errors, timeouts, etc.

评论 #24112981 未加载

评论 #24112992 未加载

评论 #24114520 未加载

评论 #24112828 未加载

评论 #24113416 未加载

lordgeek将近 5 年前

brilliant!

fred_is_fred将近 5 年前

us-east-1?

评论 #24112365 未加载

评论 #24112824 未加载

评论 #24112866 未加载

CloudNetworking将近 5 年前

You can use IBM cloud for that purpose

评论 #24112977 未加载

评论 #24112957 未加载

评论 #24112924 未加载

评论 #24112700 未加载

评论 #24112684 未加载

jariel将近 5 年前

This is a really great idea.

code4tee将近 5 年前

This is what chaos monkey does.

48 条评论

jedberg将近 5 年前

评论 #24114362 未加载

评论 #24114026 未加载

评论 #24113893 未加载

评论 #24112995 未加载

评论 #24118383 未加载

评论 #24113636 未加载

评论 #24113127 未加载

评论 #24117646 未加载

评论 #24112990 未加载

评论 #24114844 未加载

评论 #24113795 未加载

评论 #24113288 未加载

dijit将近 5 年前

评论 #24113632 未加载

评论 #24113469 未加载

评论 #24116407 未加载

davidrupp将近 5 年前

评论 #24114269 未加载

评论 #24116454 未加载

gregdoesit将近 5 年前

评论 #24117726 未加载

bob1029将近 5 年前

评论 #24114154 未加载

评论 #24117006 未加载

评论 #24114970 未加载

ben509将近 5 年前

falcolas将近 5 年前

I don't work with the group directly, but one group at our company has set up Gremlin, and the breadth and depth of outages Gremlin can cause is pretty impressive. Chaos Testing FTW.

评论 #24114735 未加载

jiggawatts将近 5 年前

msla将近 5 年前

评论 #24112694 未加载

missosoup将近 5 年前

That region is called Microsoft Azure. It will even break the control UI with high frequency.

评论 #24113692 未加载

评论 #24113918 未加载

rob-olmos将近 5 年前

bigiain将近 5 年前

kentlyons将近 5 年前

imhoguy将近 5 年前

vemv将近 5 年前

terom将近 5 年前

评论 #24112816 未加载

swasheck将近 5 年前

Wait. I thought this was ap-southeast-2

评论 #24116492 未加载

MattGaiser将近 5 年前

Whichever region Quora is using.

haecceity将近 5 年前

Why does Twitter often fail to load when I open a thread and if I refresh it works. Does Twitter use us-fail-1?

评论 #24113663 未加载

评论 #24114012 未加载

评论 #24115854 未加载

评论 #24113834 未加载

georgewfraser将近 5 年前

kevindong将近 5 年前

thethethethe将近 5 年前

评论 #24117728 未加载

评论 #24117838 未加载

djhaskin987将近 5 年前

评论 #24119307 未加载

gberger将近 5 年前

t0mek将近 5 年前

chirag64将近 5 年前

评论 #24118088 未加载

评论 #24119301 未加载

exabrial将近 5 年前

raverbashing将近 5 年前

castratikron将近 5 年前

exabrial将近 5 年前

jonplackett将近 5 年前

This is such a clever idea. I wonder if amazon are smart enough to actually do this.

foota将近 5 年前

Just deploy a new region with no ops support, it'll quickly become that.

thoraway1010将近 5 年前

A great idea! I'd love to run stuff in this zone. Rotate through a bunch of errors, unavailability, latency spikes, power outages etc every day, make it a 12 hour torture test cycle.

martin-adams将近 5 年前

I can see a use case for this being implemented on top of Kubernetes. I've no idea if that's achievable, but could go some way to make your code more resilient.

bootyfarm将近 5 年前

I believe this is available as a service called “Softlayer”

chucky_z将近 5 年前

it's us-west-1! :Dwe've had a ton of instances fail at once because they had some kind of rack-level failure and a bunch of our EC2s ended up in the same rack. :(

SisypheanLife将近 5 年前

This would require AWS to invest in Chaos Monkeys.

bootyfarm将近 5 年前

I believe this is a service called “SoftLayer”

jschulenklopper将近 5 年前

That region should have `us-wtf-1` as code.

whoisjuan将近 5 年前

Isn't this what Gremlin does?

6510将近 5 年前

Sounds useful. Crank it up to 99% failure and it becomes interesting science.

评论 #24110990 未加载

mamon将近 5 年前

Try us-east-1 :)

rdoherty将近 5 年前

评论 #24112981 未加载

评论 #24112992 未加载

评论 #24114520 未加载

评论 #24112828 未加载

评论 #24113416 未加载

lordgeek将近 5 年前

brilliant!

fred_is_fred将近 5 年前

us-east-1?

评论 #24112365 未加载

评论 #24112824 未加载

评论 #24112866 未加载

CloudNetworking将近 5 年前

You can use IBM cloud for that purpose

评论 #24112977 未加载

评论 #24112957 未加载

评论 #24112924 未加载

评论 #24112700 未加载

评论 #24112684 未加载

jariel将近 5 年前

This is a really great idea.

code4tee将近 5 年前

This is what chaos monkey does.