(Full disclosure, I work at incident.io!)<p>We recently released our On-call product, and as part of that, had to think a lot about redundancy and 'failing safety'.<p>Here's how we achieve it - and how we're thinking about it. Interested if any other examples of this exist in the wild - I'd love to know more about how eg: Datadog achieve this.
Author here!<p>It’s a fun problem to solve and one I’ve come across before when trying to alert on your monitoring tool being down, but slightly different when it’s your product.<p>Hopefully interesting if you’ve hit similar puzzles before.