科技回声

With Netflix announcing an upgrade [1] to Chaos Monkey today, I would be curious to know:- Is your team using Chaos Monkey in your production/staging infrastructure?- If not, do you use a variant tool or any interesting implementation of "Chaos Engineering" [2], and to what degree of success?[1] https://news.ycombinator.com/item?id=12743693[2] http://principlesofchaos.orgPrevious discussion (2014): https://news.ycombinator.com/item?id=8713950

I commented about Failure Fridays at PagerDuty in the older thread (not Chaos Monkey precisely, but similar concept). We still do that, with a few modifications: 1) ChatOps is used to execute the commands from Slack in order to preserve history and help interested parties follow along. 2) If we're not testing a specific service/AZ/region, a "reboot roulette" bot is run that reboots any host from our production infrastructure at random. Every single production host is game. 3) This is now scheduled to run automatically at certain times of the week.Many of the things we do regularly used to be terrifying. They no longer are precisely because we do them regularly. That's the value of chaos engineering.

I did something similar and had a test suite across a CoreOS cluster that would fail machines in patterns for hours whilst checking service and data integrity. One can run the suite against the replica cluster when new deployment features are added or changed.Doing it on live wasn't possible for me because failover resulted in some down time although I consider the former to have been very useful.

Ask HN: Who uses Chaos Monkey in production?

2 条评论

Ask HN: Who uses Chaos Monkey in production?

2 条评论