5 点作者 rorymalcolm6 个月前

2 条评论

(Full disclosure, I work at incident.io!)<p>We recently released our On-call product, and as part of that, had to think a lot about redundancy and 'failing safety'.<p>Here's how we achieve it - and how we're thinking about it. Interested if any other examples of this exist in the wild - I'd love to know more about how eg: Datadog achieve this.

lawrjone6 个月前

Author here!<p>It’s a fun problem to solve and one I’ve come across before when trying to alert on your monitoring tool being down, but slightly different when it’s your product.<p>Hopefully interesting if you’ve hit similar puzzles before.

Who watches the watchers? How we page ourselves if incident.io goes down

2 条评论

Who watches the watchers? How we page ourselves if incident.io goes down

2 条评论