Sounds like the AWS architecture caused Netflix to write better code ( read: more durable, more fault tolerant ). Less assumptions baked in the code, and it will be easier to port it to a new data center/cloud architecture if AWS doesn't meet their needs.<p>As Netflix continues to scale, these changes will make managing that growth much easier.<p>A lot of you seem to take this post as being negative against AWS architecture. I take it more as a good collection of common things that you need to watch out for in distributed environments, specifically the dangers of assumptions within your current infrastructure which may change dramatically as you scale.
Their "Chaos Monkey" approach reminds me of an excellent paper on "Crash Only Software": <a href="http://goo.gl/dqDII" rel="nofollow">http://goo.gl/dqDII</a><p>The best way to test the uncommon case is to make it more common.
This reads like the 'fallacies of distributed computing' paper (<a href="http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing" rel="nofollow">http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Comput...</a>).<p>While the likelihood of failure (or added latency, impacting upstream changes, etc.) is greater in large-scale distributed environments for which you do not control vs. your home-grown datacenter, those scenarios are just facts of life in distributed environments.<p>An awesome side effect of hosting an app in a cloud environment is that you must face up to those fallacies immediately or they'll eat you alive.
I want a Chaos Monkey, too!<p>Actually, that was my first reaction, but after thinking for a moment, that isn't really a reliable way to test. If you make changes to something, you don't know for sure if the chaos monkey hit while you were testing a certain thing or not. Proper unit tests would seem to be a lot more useful.
Basically the gist is: You need to be prepared for anything to stop working at any time.<p>The tone of this post indicates to me that the criticism and problems experienced by Netflix with AWS are understated, which I can understand given their position as a flagship AWS customer, etc.
Hardware is always going to fail eventually. Moving to AWS caused NetFlix to write better code to deal with these failures.<p>Failures were always going to happen, even in their own datacentre. What they have now is a more fault-tolerant system which should have less downtime overall.
If you do decide to adopt your very own pet Chaos Monkey in your next project, make sure you ARE able to gracefully degrade your service in case of failures. Otherwise your customers will see the monkey in action, manifested by "we'll be back shortly" messages. It's easier said than done, since a lot of the time all of us forget to write (or feel lazy, or have no idea how to properly handle) the "else" statements in case of errors/unavailable services/unreachable databases.<p>Otherwise, good idea. It forces you to think about the perils of distributed environment from the very beginning, as opposed to leaving it to be an afterthought.
I love the idea of setting up a fully working system on AWS, then repeating all traffic from your live site over to it to see how it stands up under load.<p>No need to simulate traffic for testing purposes. Here's our <i>actual</i> traffic. All of it.<p>Nice.
I'll bet other companies (e.g. Heroku, Dropbox) that use AWS/EC2 would have similar things to say.<p>I did have this one question, being a guy with an IT background: they expected stability? Really? I always expect host/app/system failure, and am pleasantly surprised when it doesn't happen.