Hi Folks,<p>Being on-call has been one of the most painful part of my job as a software engineer now a days. There were a lot of stressful weeks I had spend with completely demotivated about how much time I have been spending on these issues which can be spent on the innovation. So I have listed my top issues in below ranks. I was wondering if others feel the same pain? I also wonder why can there be a solution built for these?<p>1. I do not have enough information in alert to jump right on the resolution<p>2. It’s not easy to find similar alerts triggered recently so that I can go back and find how they were fixed?<p>3. I don’t find runbooks useful most of the times as they are not up to date<p>4. I don’t know if there were any recently merged changes which caused these alerts/incidents<p>5. A lot of the time, I don’t know whom to reach out to if this alert is from other team.<p>6. I have to go to multiple systems to update the statuses or notes<p>7. I have to summarize all the details again as a part of on-call handoff summary doc at the end of the rotation
I've experienced all these problems. There are solutions trying to address them. E.g. <a href="https://incident.io/" rel="nofollow noreferrer">https://incident.io/</a> (which I'm not affiliated with in any way). It's not easy though. I think they all come from the root cause of teams not investing enough into making oncall processes and solutions good, and in particular not keeping things up-to-date. As you say, runbooks are often outdated. The same happens with lists mapping component ownership to teams.<p>There's another problem (#8 to add to the list) I also felt pain from: how you're scheduled to work oncall. We had ad-hoc manual scheduling of who would work oncall when. A tool for solving that is <a href="https://oncallscheduler.com" rel="nofollow noreferrer">https://oncallscheduler.com</a> (which I am affiliated with). It automates the oncall scheduling, while making it fair, predictable, and gives all engineers self-service control over when and how they're scheduled. I'd love some feedback on it.
I previously worked at a popular startup. I was part of a team that owned many business critical services. Our on-call was brutal. I have firsthand experienced most of the problems that you are talking about.<p>But interestingly, we solved these problem back then using an internal tool. Here is how the internal tool solved the problems -<p>1. It had integration with all internal tools like task management, alert management, monitoring systems and pagerduty. It offered 1 central dashboard where it all came together<p>2. Each team in pagerduty can see details of all alerts that happened in given shift/rotation. So anyone can go there anytime to see what alerts are fired, when and to whom.<p>3. Each alert had option to mark with various tags like noisy, non-actionable, etc. Additionally a note and follow up task links can be added.<p>The tool solves some of the problems you mentioned properly. e.g. You don't need to write a summary document. It all gets captured there and can be easily viewed in the handoff meetings. With tags, you can easily find bad alerts or alerts with outdated runbooks. Its easy to hold team-members accountable if they are not following process/best practices.<p>Oncall is a heavy process but IMO with right tooling, a lot of problems can be solved properly.<p>PS - I didn't create the tool but I used it extensively to get my team's oncall under control.
Some of these need to be fixed at a higher level.<p>On call monitoring responsibilities for a certain time period should be separate from resolution duties.<p>In other words, aside for some well defined ops issues that have clear runbooks, the role of the person monitoring should be find out or know who to escalate to, not resolve.<p>It's actually a great onboarding activity as it exposes new staff members to parts of the infra and operations that their managers/peers might have neglected to mention.<p>The second way to alleviate the issues is to pair a person such as yourself with a person that has a lot of institutional knowledge so that you can triage together, learn from them, and update the docs so the organisation as a whole has better resources. Eventually, the percentage of incidents where you don't have the institutional knowledge to know how to proceed will decrease to the point where it's mostly safe for you to do on-call on your own.<p>Then eventually you become the experienced on-call person that gets paired with the new employees that are gaining that institutional knowledge.
I’m actually building exactly this. It’s a simple on-call and incident management platform that covers many of your frustrations since I’ve had the same ones. I’d love to talk to you about it and get your feedback on my progress so far if you’re interested.
I worked for a systems integration and management firm for five years. We avoided the sorts of problems you describe by having a management who gave us the best tools and training for our work and in return demanded exacting documentation which had to be kept up to date as part of our work. Logs and alerts were refined to eliminate confusion. We were tasked with implementing scripts to correct, mitigate the effect of problems.<p>Being on-call is a challenge, but also an opportunity to improve processes. Your management should empower the team to fix the processes.
I work at <a href="https://incident.io/" rel="nofollow noreferrer">https://incident.io/</a> where we build a product that aims to help with this.<p>I'll callout some of the points you make that we can help with:<p>1. Our incidents are created automatically from triggers like PagerDuty incidents/OpsGenie alerts/etc and we'll pull any information we find into the Slack channel and make it easily available (pinning to channel, setting it automatically as a channel bookmark, etc). That tends to help when you jump into the incident fresh, having everything easily available.<p>2. We don't do much matching against previous incidents (yet) but it's easy to search for similar incidents in our dashboard. Unlike alerts, incidents have a history of updates and curated detail about how they were resolved, so a history of similar incidents is genuinely useful to you if you're facing a similar problem.<p>5. We have an in-product catalog where you can store features, services, etc and who owns what. Most customers ask people 'what is affected?' and have us automatically page or say who owns the feature, which really helps speed up response. Some of our customers have 5k+ services, there's no way humans can remember who owns what at that scale.<p>6. This is our bread-and-butter, in that we plug into everything like status pages (we offer a native status page ourselves), Jira, GitHub, whatever to make sure incident updates are pushed everywhere. The idea is responders update the incident and we'll go share that everywhere it needs to go, instead of asking people to remember when they're busy responding.<p>7. Our incidents help massively with this. Provided responders are pushing updates to their incidents, an on-call handover turns into a super-quick review of the incidents in our dashboard and a review of the updates/outstanding actions.<p>tl;dr: a lot of what you've described can be fixed or helped massively by good tooling. Even stuff like runbooks being out-of-date is improved by tools that more frequently connect people to runbooks, as if they're more reliably useful people are more incentivised to update them.<p>Won't solve everything but if you're on the lookout for solutions you should definitely check-out incident.io and similar tools.
I think the biggest hussle during on-call is that a lot of stuffs have no clear ownership so I don't know who to throw the hot potato to. We are switching to a better solution with clear ownership so hopefully it helps.