We built a system called „friendly fire“ that nukes a server every 10 minutes. It has changed the mindset of all engineers and made our infrastructure missile-proof.<p>Funnily enough it also improved our latencies a lot (which I guess is mostly due to memory leaks et al.)
The following link shows how we do Chaos Engineering in TiDB, an open source distributed database:<p><a href="https://www.pingcap.com/blog/chaos-practice-in-tidb/" rel="nofollow">https://www.pingcap.com/blog/chaos-practice-in-tidb/</a><p>Regarding the Fault Injection tools we are using:<p>- Kernel Fault Injection, the Fault Injection Framework included in Linux kernel, you can use to implement simple fault injections to test device drivers.<p>- SystemTap, a scripting language and tool diagnose of a performance or functional problem.<p>- Fail, gofail for go and fail-rs for Rust<p>- Namazu: a programmable fuzzy scheduler to test a distributed system.<p>We also built our own Automatic Chaos platform, Schrodinger, to automate all these tests to improve both efficiency and coverage
I have not used it, but I have heard this is a very useful tool <a href="https://github.com/Netflix/chaosmonkey" rel="nofollow">https://github.com/Netflix/chaosmonkey</a>
Other useful materials:<p>- Chaos Monkey Guide for Engineers <a href="https://www.gremlin.com/chaos-monkey/" rel="nofollow">https://www.gremlin.com/chaos-monkey/</a><p>- Recent HN discussion on Resilience Engineering: Where do I start? <a href="https://news.ycombinator.com/item?id=19898645" rel="nofollow">https://news.ycombinator.com/item?id=19898645</a>
If you've never run a chaos experiment, how do you square up blast radius with running in prod?<p>It seems like this setup works great if built from the get-go but incredibly painful and possibly dangerous if starting with existing applications.
A thread from 2018: <a href="https://news.ycombinator.com/item?id=16244586" rel="nofollow">https://news.ycombinator.com/item?id=16244586</a>