TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Who uses Chaos Monkey in production?

7 点作者 nbraga超过 8 年前
With Netflix announcing an upgrade [1] to Chaos Monkey today, I would be curious to know:<p>- Is your team using Chaos Monkey in your production&#x2F;staging infrastructure?<p>- If not, do you use a variant tool or any interesting implementation of &quot;Chaos Engineering&quot; [2], and to what degree of success?<p>[1] https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12743693<p>[2] http:&#x2F;&#x2F;principlesofchaos.org<p>Previous discussion (2014): https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=8713950

2 条评论

romanhn超过 8 年前
I commented about Failure Fridays at PagerDuty in the older thread (not Chaos Monkey precisely, but similar concept). We still do that, with a few modifications: 1) ChatOps is used to execute the commands from Slack in order to preserve history and help interested parties follow along. 2) If we&#x27;re not testing a specific service&#x2F;AZ&#x2F;region, a &quot;reboot roulette&quot; bot is run that reboots any host from our production infrastructure at random. Every single production host is game. 3) This is now scheduled to run automatically at certain times of the week.<p>Many of the things we do regularly used to be terrifying. They no longer are precisely because we do them regularly. That&#x27;s the value of chaos engineering.
usgroup超过 8 年前
I did something similar and had a test suite across a CoreOS cluster that would fail machines in patterns for hours whilst checking service and data integrity. One can run the suite against the replica cluster when new deployment features are added or changed.<p>Doing it on live wasn&#x27;t possible for me because failover resulted in some down time although I consider the former to have been very useful.