(Full disclosure, I work at incident.io!)<p>We recently released our On-call product, and as part of that, had to think a lot about redundancy and 'failing safety'.<p>Here's how we achieve it - and how we're thinking about it. Interested if any other examples of this exist in the wild - I'd love to know more about how eg: Datadog achieve this.