I <i>think</i> the main thesis here is that chaos testing is the only way to detect ‘unscalable error handling’, but that most ‘unscalable error handling’ faults could be eliminated by testing for ‘missing error handling’ and ‘unscalable infrastructure’, which should be able to be tested with less disruptive techniques than ‘chaos’.<p>I’m not sure I follow the argument though.<p>Just because you have demonstrated that a system is scalable, and that it is tolerant of errors, does not imply it is tolerant of errors at scale.<p>The example given of Expedia’s error handling that, they claim, could have been verified without chaos testing:<p>> Expedia tested a simple fallback pattern where, when one dependent service is unavailable and returns an error, another service is contacted instead afterwards. There is no need to run this experiment in production by terminating servers in production: a simple test that mocks the response of the dependent service and returns a failure is sufficient.<p>When the first service becomes unavailable, does the alternate service have a cold cache? Does that drive increased timeouts and retries? Is there a hidden codependency of that service on the thing which caused the outage if the first service?<p>Maybe that can all be verified by independent non-chaos scalability testing of that service.<p>But chaos testing is like the integration testing over the units that individual service load and mock-error tests have verified. Sure, in theory this service fails over to calling a different dependency. And in theory that dependency is scalable.<p>Running a chaos test confirms that those assumptions are correct - that scalability + error tolerance actually delivers resilience.