This is wonderful, especially the section on what didn't work ("Anti patterns," bottom of page).<p>This one in particular feels like good advice to startup founders as their company grows:<p>> Trying to take on multiple roles.#
In past PagerDuty incidents, we've had instances where the Incident Commander has started to assume the Subject Matter Expert role and attempted to solve the problem themselves. This usually happens when the IC is an engineer in their day-to-day role. They are in an incident where the cause appears to be a system they know very well and have the requisite knowledge to fix. Wanting to solve the incident quickly, the IC will start to try and solve the problem. Sometimes you might get lucky and it will resolve the incident, but most of the time the immediately visible issue isn't necessarily the underlying cause of the incident. By the time that becomes apparent, you have an Incident Commander who is not paying attention to the other systems and is just focussed on the one problem in front of them. This effectively means there's no incident commander, as they would be busy trying to fix the problem. Inevitably, the problem turns out to be much bigger than anticipated and the response has become completely derailed.
That's interesting it seems to assume there is no 24/7 ops coverage on site who triage before escalating the call to actual on call staff.<p>This is from being the lead on call for the UK's tymnet billing system normally if I got called some initial triage
One thing I've been wondering about is how smaller organizations handle on call. Looking at the roles laid out [1] and assuming an on-call schedule of 4 weeks off, 1 week secondary, and 1 week primary, that's a team of at least 25 people.<p>For an organization of just a handful of engineers, how do they make on call work? A single on-call rotation would stretch the team to its limits, and it's likely that certain domains would only have a single Subject Matter Expert.<p>[1] <a href="https://response.pagerduty.com/before/different_roles/" rel="nofollow">https://response.pagerduty.com/before/different_roles/</a>
let me guess...it includes getting paged as a very important step. j/k, this is very well written. Good for sharing to those whom have not been on call at a good org before.
> "A guide to being on-call... we all have lives which might get in the way of on-call time"<p>How dare those engineers live a life. They're on-call goddammit.