I've read about companies having chaos monkeys to check that failed machines don't take your site down. When do they typically start doing that? It doesn't seem to make sense at our scale.
The 'size' at which you unleash the chaos-monkey is i think a matter of a lot of subjectivity. The QoS <i>expected</i> is one factor. Can you <i>afford</i> to fail?<p>Formally, two 'requirements', in order to randomly kill your servers are-<p><pre><code> 1. You have Highly Available infrastructure.
2. You have fail-over established across the architecture.</code></pre>