With Netflix announcing an upgrade [1] to Chaos Monkey today, I would be curious to know:<p>- Is your team using Chaos Monkey in your production/staging infrastructure?<p>- If not, do you use a variant tool or any interesting implementation of "Chaos Engineering" [2], and to what degree of success?<p>[1] https://news.ycombinator.com/item?id=12743693<p>[2] http://principlesofchaos.org<p>Previous discussion (2014): https://news.ycombinator.com/item?id=8713950
I commented about Failure Fridays at PagerDuty in the older thread (not Chaos Monkey precisely, but similar concept). We still do that, with a few modifications: 1) ChatOps is used to execute the commands from Slack in order to preserve history and help interested parties follow along. 2) If we're not testing a specific service/AZ/region, a "reboot roulette" bot is run that reboots any host from our production infrastructure at random. Every single production host is game. 3) This is now scheduled to run automatically at certain times of the week.<p>Many of the things we do regularly used to be terrifying. They no longer are precisely because we do them regularly. That's the value of chaos engineering.
I did something similar and had a test suite across a CoreOS cluster that would fail machines in patterns for hours whilst checking service and data integrity. One can run the suite against the replica cluster when new deployment features are added or changed.<p>Doing it on live wasn't possible for me because failover resulted in some down time although I consider the former to have been very useful.