Ask HN: What is your company's on-call rotation like?

82 pointsby bobzimutaover 9 years ago

Also looking for readings on best practices.Curious about your company's on-call for details about:* expected duties (only answer on-call, do other work, etc)* how deep does your on-call dive into code to fix, vs triaging, patching, and creating follow up work to fix permanently?* priority order of work (on-call tickts vs regular work vs existing on-call tickets vs special queue of simple tasks)* what happens when the current on-call can't complete in time?* how do you manage for other teams' risk? (ie their api goes down, you can't satisfy your customers)* any other tidbits

25 comments

jeletonskellyover 9 years ago

To software developers in this thread who are on-call; I'd like to share some thoughts with you. I've worked at places that do have on-call rotations and others that have none. I will no longer work at a company that requires me to be on-call.Why? It says a lot when a company doesn't put the effort into various forms of testing and QA to ensure that production software does not have critical issues that warrant at 2am call. Unit, functional, integration, load, and simulation tests should be written for every single piece of critical infrastructure. You should be hammering these things in staging environments with 10-100x of your normal peak load.Use something like Gore to replay live traffic against a version in staging or QA environments. Yes, that takes work, but to me it's better upfront than to wake me up in the middle of the night or to know that when I go home I have to have my phone around me at all times. The business should care about these things too; it's their product and they should care enough about you to make sure good processes are in place to ensure quality production software.That said, when I was at non-on-call companies there are definitely times when something does happen that warrants immediate attention. Generally someone in operations would get the first call, they check logs, diagnose the issue, and call a developer familiar with the app that's causing the issues if that's the case. I don't mind waking up because I know it has to be something serious that slipped past our processes.

评论 #10895018 未加载

评论 #10894264 未加载

评论 #10894200 未加载

评论 #10911702 未加载

biotover 9 years ago

For those doing on-call and required to be responsive to alerts at all times during your shift, one thing to think about is your pay for this burden. Some cardiologists get over $3K/day to just be on-call (whether or not anything comes up) which means keeping within range of the hospital, being sober, not being more than a few minutes away from their phone, and so on. While generally nobody's life is on the line should a server go down, if you're expected to go beyond "if I hear my phone alert and I'm sober and am near my laptop" level of effort, ensure you are compensated appropriately.

评论 #10894978 未加载

评论 #10895075 未加载

评论 #10894434 未加载

评论 #10894576 未加载

gnurover 9 years ago

Current rotation for me is: 1 out of 3 weeks from 7 am till 12 midnight as a systems administrator. Our SLA to customers is non-existent, so on-call duties are best-effort (if you are in the shower and it takes 30 minutes to respond, no real problem). Furthermore, this also means that it is up to the person on call to determine what needs to be done.Outside of the regular work hours you are only expected to work when an alert / call comes in. If at all possible you should fix it, if it would take > 1 hour you start looking at temporary workarounds that will hold until you are back at the office. During regular work hours on-call tickets always get priority. When problems arise with external dependencies we open tickets in their support system and wait for their response. Most important in our case is to keep the support desk up to date so customers don't open too many tickets and to let people know we are working on it.Depth of on call work depends. It usually consists of rebooting / restarting services and checking if they come up as expected and trying to find the root cause of the problems. If the root cause can be easily identified and fixed, fix it immediately. Otherwise, open a ticket to fix it asap.

dnnrlyover 9 years ago

I started being on-call a few months ago. For us it works like this:- 1 week in 4 Monday 00:00 to Sunday 23:59, with a fixed pay adjustment (same fixed amount added to our salary every month, not a % increment).- If you get a call, you're on the clock and get paid regular over-time rate. Theoretically this only applies if you get called and are doing something for more than 30 minutes though.- I know most of the on-call managers personally so don't mind picking up the phone for a quick opinion for free, if it's at a reasonable time.- Online within an hour once we've picked up the phone.- Senior on-call manager gets woken up by Ops first and then will decide which parts of the system are affected. After going through known issues and work arounds will then work out which part of the system needs to be looked at and makes the call to the on-call developer that is appropriate (ie. which team).- Duty is primarily to diagnose and provide knowledge to the on-call manager to make a decision so service can come back up again.- Don't do code fixes (J2EE on my team, mix of Java and C++ on others) - we have far too much to go wrong for hacking around like that, but may do config changes and some scripting. The line can get a bit blurred.- In our team we should raise a ticket/Jira for every time we get a call so that it's visible to everyone what happened and see where our pain points are.- During office hours, Production issues are distributed in the team in the same way as bugs, according to domain knowledge and individual capacity.- We tend not to get called out except for when we do installation in to Production. We have a fairly heavy weight process for releases - which we are working on.- I am at least 4 levels of communication away from any customers, including 1st and 2nd line teams.Edit: formatting

rodgerdover 9 years ago

On-call is paid at a weekly rate for being the on-call person. There's an additional bump if you're actually called out, pro-rata on an hourly basis to your salary.You are required to respond to calls within the SLA, and work on problems until they're fixed (definition of fixed being service restoration - root cause canalysis for complex problems and hence fixes will happen during normal hours) or until you've been working long enough you need to be spelled.If you're on-call you're still expected to be doing your day job, but production takes priority, and projects are expected to accomodate that.On-call rotation is generally rotated around teams, so one week in 4 - 8 depending on team size.

评论 #10894182 未加载

notmekneesover 9 years ago

Funny that I decided to finally create an account to answer this.Our team is heavily silo'ed, and each developer is responsible for a specific set of code. Since we have gone through a bunch of reductions and reorganizations, this ends up meaning that often one person is responsible. Our customer is a 24/7/365 operational unit, and it is our only customer.So, the end result is any given developer is on call all the time, with no pay compensation for that at all. Not only that, but you should expect to be called about not only a defect, but in cases where the customer is trying to use our software in a way it was never designed to be used. So, get a call at 2 AM on the weekend and design and implement a new feature, release to production to debug and test.I really dislike the way we operate, and recently the entire QA team was let go so I have a feeling things are only going to get worse.

评论 #10894007 未加载

评论 #10893917 未加载

评论 #10894996 未加载

评论 #10895015 未加载

评论 #10893713 未加载

scardineover 9 years ago

In Brazil the labor relations are heavily regulated. Each hour you are on-call you make 1/3 of a regular hour of work, and while you answer a call and work on weekends or at night you are paid twice the regular hourly wage.You are expected to answer the phone promptly, be sober and do anything to remedy the problem including reaching for other people that can fix it (does not have to be a permanent solution).

throwawayoncallover 9 years ago

Work at a well funded Series B startupOn-call is unpaid and no additional bump if called out or public holidays. No additional vacation in lieu either :(- Expected duties are to fix the issue and bring production systems back up, all engineers have context on production systems and can escalate if any additional context is needed. We have an established workflow of steps to take to determine root cause. We can also directly ping all engineers to provide additional help should there be a critical issue with a system you are unfamiliar with.- If you are on-call, the on-call duties come first and foremost above everything else. Expected response time is 10-15 minutes from issue raised at all times.- Mitigating risk is via communication and getting enough context to know the bad outcomes and how to mitigate any issues.- On-call rotation is very arbitrary and undefined (we have five engineers who can do on-call duty), we try to share the duties equally but some get assigned way more on-call than others. (I was on-call most weekends last couple of months and all throughout the Xmas period for example)

评论 #10893529 未加载

评论 #10894215 未加载

cookiecaperover 9 years ago

Everyone takes a turn being on-call in one week cycles. We have an on-call day and an on-call night. Any infrastructure alarms go to the on-call person and he/she is expected to address the issue, which can include escalating it to the person who knows the most about it if the severity warrants that, or deciding it's not important and silencing for some time period.On-call fixes until the crisis is mitigated. If something can be addressed the next day during normal hours, it is.On-call tickets are the most important. In our company, "on-call" just means you get the infrastructure alarms for that week and time period, and gotta deal with them when you get them. There is no alteration in pay for on-call weeks.If the current on-call can't respond within 10 minutes (or forgets to ack in PagerDuty), the next person on the escalation schedule will be notified and expected to deal with the issue.Other teams' risk is mitigated because if it's breaking production, we just call them and make them fix it. The whole company is "always on-call" in some sense, because if your thing is breaking production, you're going to get a call and get asked to fix the problem.We feel like the tooling on this, even with PagerDuty whose whole job is managing on-calls, is subpar and are constantly talking about creating an in-house replacement.

lucaspillerover 9 years ago

A few years ago I worked at a company that paid £30/day for being on-call and £30/hour (or part thereof) while investigating / fixing incidents. We were usually on call for a week at a time with a rotation of around 6 developers.We got SMS or phone call (we had a non-technical ops team that worked early/late monitoring systems and tried to do preliminary fixes) and were expected to be available and investigate 24/7. If you didn't acknowledge the SMS alert within 30 minutes everyone else on the team got it, so it was in your best interest to do so to avoid waking up everyone else :)Some of the services we had were pretty critical, so the fix in that case was 'as long as it takes to get it working', and then putting in a permanent fix during business hours. For lower priority services, we'd just leave it broken and fix it during business hours. If you were up fixing something in the night you weren't expected to be in at 8am the next day.In terms of escalation if there was a major problem that couldn't be handled we'd usually get in touch with a few other people on the team to get their input. This hardly ever happened though, the only time I can remember is when the ventilation system failed in a data centre which was !FUN!After that I worked at a startup that didn't pay for any overtime or on call, which is one of the reasons why I left. Since then I've been contracting, so haven't had to worry about on call but I'd be happy to do it again for the right price.

fweespeechover 9 years ago

1) We don't keep a formal on call but 3 of us are tied to an automated alert system and whoever has a chance to take care of it, does. We are all full stack devs so generally we can fix it at the time. If its complicated, we can help it hobble and fix it later.2) We get 3-4 alerts a year that have to be handled before the next business day.3) As such, there is no real work priority, triage, etc. You resolve it immediately. There is no other priority. [ Any on-call event == lost money ]> * how do you manage for other teams' risk? (ie their api goes down, you can't satisfy your customers)Asynchronous processing. I buffer until their API is back up. I do this for literally dozens of companies from small manufacturers to Amazon.There really isn't any other good way to handle it and if you need to do it otherwise that really is a fundamental architectural problem that should have been resolved at the design phase.

评论 #10895020 未加载

michaeltover 9 years ago

Overview: On call one week in six; pay £300/week for being available (extra for national holidays), and time-and-a-half for any time spent dealing with calls, rounded up to the next 15 minutes.Expected duties: Respond to major problems outside of office hours, in response to a phone call. Be online fixing the problem within 15 minutes of being called. Typical workload 2 hours per week. Be somewhere with a reliable internet connection, and where you can stay even if an issue takes a few hours to resolve (no mobile internet from campsites). Be sober.Fixes: Because of the need for code review and second testing, emergency code releases never happen out of hours. Instead, all code deployments include a backout plan, which reverts to a known-good version of the code (not a bug-free version, there's no such thing, but a version that hasn't caused major problems when run for several weeks). Data in the database may need to be manually fixed.If you're patching an issue by deleting some bad data, and in your judgement deleting the data might delete evidence needed to identify the root cause of the problem, try to identify the root cause if time allows.Priority: On call is only called for major problems - either their software has raised an alert saying they should be called, or there is a problem costing thousands of pounds a minute such as "website down", "customers cannot place orders" or "customers will not receive orders".There isn't a formal SLA, but as it costs thousands of pounds a minute the expectation is to fix as fast as you safely can, without mistakes that make the problem worse.Escalation: To their team leader, or to some other senior software engineer on their team (our team leaders have all been senior software engineers on the team they lead)APIs down: In a rather old-fashioned design move, all critical components are maintained in house. Contact the out-of-hours on call for whatever team maintains the broken system. If it's a non-critical component.

ogsharkmanover 9 years ago

Programmer at a Hospital -- Only have to do stuff if I get paged (rotation is 1 every 4 weeks for 7 days, 5PM-8AM). If the issue(s) affects a clinical caregiver then the issue needs to be resolved asap no matter what, priority is only regarding the issue you were paged for, if the on-call person can't resolve they will usually page additional people or escalate to a manager who can find the right people. If another team's issues are causing us problems we just need to inform the customer about what's going on and then keep them updated.

driptonover 9 years ago

I work at a big famous company that strongly believes in devops.I'm on call about one week in 6. (About 12 people in the rotation, with two on-call at a time.)When on call, the ticket queue is supposed to have priority, but managers have an unfortunate tendency to say "that guy's on call so he's not doing anything so let's have him do it." So what ends up happening is that high-severity tickets get priority, then whatever random stuff the manager wants done, then low-severity tickets. So low-severity tickets tend to remain open for way too long because the oncalls aren't making them a priority.For high-severity tickets, we escalate if we can't fix things quickly. First bring in the other oncall, then bring in the manager, then bring in anyone else on the team who can help.When a dependency breaks, we can ticket its owners to fix it. If it gives us a high-severity ticket, then they get a high-severity ticket too. Ultimately there's a tradeoff between using fewer dependencies in the name of controlling your own fate, versus using more dependencies to avoid reinventing the wheel. Each team has to find that balance for itself.I don't enjoy being on call, but I think it's the best solution we have. Stuff breaks and someone has to fix it. If you don't pick someone, then everyone has to fix it. If the same people who build something also have to fix it, they have strong incentives to do things correctly.

elliottcarlsonover 9 years ago

Previous role:* On-call is a week long (Tuesday to Tuesday 10am) rotation* Primary duty on a page is to triage and be a point person during the incident* Proper procedures during day-to-day coding should make the on-call shift be uneventful 99.9% of the time* Managers are escalation point, then Directors, then VP of Eng* Revert to previous state and call it a night is an acceptable fix - the primary issue can be resolved the next day* Blameless post-mortems

spotmanover 9 years ago

Hi,My company does 24/7 devops for some clientele. We are a team of 6 for reference. We have been doing this for many years since before "devops" was the term for it. Some of the platforms we have also built for these clients, and some we simply manage, or we have only built+manage the automation.In terms of answering your question:- expected duties. In terms of what we are selling the client there is a long specific list of what exactly we can provide the client. In practice however the list is exhausting, and intended 1/2 for CYA, and we will fix whatever issues arise. Some of that is seeing things that need work/adjustment/tuning/optimizing before problems arise, and then as you can imagine the whole being-on-call thing means that you must be an expert at resolving issues that no one saw coming. In terms of SLA stuff, we are supposed to be on-scene (digitally) in 15 minutes, but we are never that slow.- how deep. Think of it in phases. When something needs fixing, you must quantify the urgency. If there is an urgent fix needed this is ground zero. You do whatever you can, if that involves waking people up, or checking out code bases that are new to you and patching some developers code in the middle of the night, its all on the table. In ground zero mode, you do whatever it takes. Once the bandaid/fix/solution is in place to "stop the bleeding" as I like to say, then there is follow ups, that may include working with the client to have them implement the elegant permanent fix, or if they are in over their head, sometimes that lands on us too. But it serves no-one to have a bandaid that is going to just get ripped off every night. So we see the problem from A-Z, even if our offramp is on B, or M. If it's not urgent, we will just wait for tomorrow to discuss with the client, it falls out of the scope of on-call.- priority. Well, on-call work is billed and agreed different to ticket work since we are primarily a consulting company. So these are different buckets. But we also have our own products that we own 100%, and the priority is quite easy. If something is broken/down/about-to-break, it trumps everything else. Regular tickets are great, but they are meaningless if the platform for which your developing them against can't stay functional. On-call work rarely has a "queue" or even really a ticket system.- there is no "complete in time" it's either done or you failed. I say this, having failed too, and it sucks, but it is what it is. If something breaks, and you can't fix it, you don't just go home. But.. sometimes you do, but you walk away from the scene with your tail in-between your legs, no one plans for that.- managing other teams risk. Communication. Putting energy in ahead of time, and bringing things up before they break is huge. Also if you say "Hey you should turn left, there is a cliff!", and the client is insistent on turning right, this can do two things. A - they know and hopefully its recorded in an email or in a meeting that you wanted to turn left. B - if your absolutely certain they are going to run over a cliff, but your still on the line / have to support the darn thing anyway, you can quietly put a bunch of pillows at the bottom of the ravine, and prepare for the inevitable. When the car goes over the cliff, and everything almost grinds to a halt, and you manage to have been correct about it, but also managed to bail them out, you score a lot of gratitude from the client, and ongoing future relations.This is all from the point of "consulting" mostly, but I have done this same type of work for many large companies directly on their payroll too, it all applies, and the bigger the company the more it's like consulting in some way also, because bigger companies are more compartmentalized. I have also done this in many small startups where your all just one small team.Holidays and vacations are important, but they will never truly be the same after years of this. We are pretty good at it now, and we really do try to keep everyone up to speed with where "the bodies are buried" with all of our clients infrastructure. That is the hardest part. Everyone can look at a perfectly groomed wiki or set of 100% refined chef/puppet cookbooks/modules, but the world isn't perfect. So the hard part is learning how to take the punches with elegance, and people need a break. It really does take at least 3-4 people to have a not-insane 24/7 schedule.We generally plan about 1-3 weeks ahead depending on the time of year and put it into our scheduling system, which is also the system that does our paging. We rotate weekends, and work it out amongst ourselves. Some people have things like date nights that we know about and try hard to not screw up.Don't build your pager system yourself, you have enough stuff to do. I won't plug them because I don't want this to sound like an advert, but do yourself a favor and pay the money to a company that specializes in bubbling up alerts to notifications that can do phone/sms/email/ios/android/push/etc. These have really helped us manage the insanity & schedule.If there is any advice, simply don't be frightened to say when you don't know something, even if your tasked with fixing it. There is little room for ego when your the one that ultimately has to be responsible. Being on call means you can't pass the buck, and the sooner you know what you don't know, the sooner you're able to learn it!Edit: Typos

fractalcatover 9 years ago

Last time I was in a role which involved on-call rotation:> expected duties (only answer on-call, do other work, etc)Only answer pages. My employer did shifts a bit differently from most companies - only six hours per shift, no fixed schedule (decided a week in advance) and only outside of work hours (pages during work hours were handled by whichever sysadmins were on duty), which worked quite well to avoid burning out sysadmins. On-call shifts were paid, and shortages of volunteers were rare.I'd expect to spend maybe fifteen minutes per shift fixing things, on average (this is in managed hosting, so a page could be any of our customers' services).> how deep does your on-call dive into code to fix, vs triaging, patching, and creating follow up work to fix permanently?In my case (sysadmin for a managed hosting company) the code involved was often not under our control; the standard practice was to escalate to the customer if the cause of the outage was a bug in the application. The usual process when suspecting a bug was to track it down if possible (the codebases were usually unfamiliar, so this wasn't always the case), work around it as best we could (e.g., temporarily disable a buggy search indexer which was leaking memory, et cetera), and then get in touch with the customer (by email if the workaround was expected to last until work hours, by phone if not). Occasionally I'd fix the bug in-place and send the customer the patch, but this was technically outside scope.> priority order of work (on-call tickts vs regular work vs existing on-call tickets vs special queue of simple tasks)The only priorities were resolving the pages at hand and arranging followup where needed (usually raising a ticket to be followed up during work hours).> what happens when the current on-call can't complete in time?Generally the on-call sysadmin would resolve whichever pages they had acknowledged; in the event of an extended outage the acking sysadmin was expected to brief and hand over to the person on the next shift.> how do you manage for other teams' risk? (ie their api goes down, you can't satisfy your customers)In practice, we could escalate to anyone in the company for a serious outage we were unable to handle ourselves. This was pretty rare, as a small ops-heavy company, but everyone had access to everyone else's cell phone number and an outage-inducing bug was usually sufficient cause to wake someone up if it couldn't be worked around.

foxyvover 9 years ago

Developer on-call every 8 weeks or so. 90% of the job is checking logs and saying "Yup it's working" or "Nope $VENDOR is down again." Get paged on average 3 times a cycle. One time I actually got to fix some code. That was awesome. Just feels really weird that no one in the tier 1 center can check logs at 6AM. But then again there is a lot of turnover.Don't get paid for On-Call, but the job is super flexible with hours so we get flex time which is well worth it. I like being able to take a nice 2 hour lunch every once in a while.

gambitingover 9 years ago

I work for a major games developer/publisher. Our submission packages are being built this week and will be submitted next Monday. We are supposed to be on-call this weekend just in case any problems in any of the systems we were responsible arise. No, I am not paid for being on-call, just like I am not being paid overtime. From what I can tell people are going to come in this weekend just to play the game and make sure everything works even though they are not getting paid for it.

评论 #10894227 未加载

laumarsover 9 years ago

I'm basically expected to be 24/7. Which gets a bit much some weeks.

评论 #10896433 未加载

jjpover 9 years ago

In a past life covering out of hours (typically 9pm through 6am) for a core application for an insurance company the set-up we had was that we had a ladder of 4-6 staff providing on-call support. Whom ever was top of the list was first paged and if they didn't acknowledge within x minutes we kept going down the list until somebody responded. Once you got called you owned all incidents for that night and the next morning you would be moved to the bottom of the list.When we got called out we were well recompensed at minimum of two hours of 3x hourly rate. If you resolved incident in 10 minutes you still got 6 hours of pay. If you got another call within those first two hours then you didn't get anything extra until you were working for 2 hours and 1 minute , but if the 2nd call came in after 3 hours you started the clock again. During times of instability (new code release) we often had management agree to you working from first call-out until the day shift came in to order to minimise downtime. When we were working during the night the on-call always had priority, but if you were sat around waiting for the system to do something then we all did something else to keep us awake. But there was no expectation that it was other tickets or project work. We made personal choices of doing day work or it could be crossword puzzle etc.We all had areas of expertise and if you caught something you didn't know how to resolve then we had the authority to ring anybody. And that person would also receive the same compensation. In 5 years doing this I never had a complaint from calling somebody else out. Did we get tired yes? Most of us had project roles or team lead roles during the day but we all knew that the production system had priority and we and our own line management would deal with the project work accordingly. Did we get burned out on this? No because the rota meant that you still had a life. If you were top of the list and had other commitments that night you'd negotiate coverage with somebody else further down the list to see who could go top for the night.My employer did this because the cost of not doing and having downtime during the day was a lot higher. If the system failed out of hours then it could impact our business and we could lose c.8,000 hours of day shift working time that would still require salary payment. Plus reputation damage etc. Most I ever accumulated over a weekend was 32 hours at 3x (situation so bad required to restore database to last known good and then catch-up log files).One night I also got it wrong and in trying to correct the incident forced us into a situation when we had to do a database restore. That caused about 4,000 man/hours of downtime. When my bosses boss came in at 6:30am (he got called out by my boss) I got sent home being told I was too tired to continue support, day shift staff would take over and the missive to ring him when I had slept. I made the phone call expecting a significant dressing down only to be told, everybody else who looked at the diagnostics said they would have done exactly the same, we've all learned, and you stay top of the list tonight so that we know you're not scared to get back on the horse. Truth be told it did knock my confidence and anything that was a little out of the ordinary after that I always ended up calling somebody else for a consult. After a while I engineered my self off the on-call list.

romanhnover 9 years ago

- Every team is free to choose the on-call practices (escalations, ramp-up for new members, etc) that work for them, although many practices are shared. Tooling is the same across the board. I'll speak for my team.- Rotations are mixed, consisting of weekdays and weekends. In other words, if you're on-call M-F this week, in a few weeks you'll go on-call over the weekend.- If the on-call engineer does not pick up, the incident is escalated to the manager, then their manager, all the way up to the CTO. Some teams have secondaries before managers are involved.- The on-call engineer would normally work off the regular backlog. We have talked about pulling them out to work on on-call + tech debt exclusively, but there hasn't been enough on-call churn to justify this.- The on-call is on the hook for resolving the immediate issue. In many cases this does not mean actually fixing the underlying problem. Those get written up as tickets to be prioritized as part of the backlog.- The priority is determined as a team. As a manager, I encourage the team to aggressively tackle on-call issues, don't want to be a blocker for that. If something consistently pages us, it is guaranteed to make it to the top of the backlog fast. We have also chosen to prioritize product work over on-call tickets when we felt our pace was good and the on-call churn not too terrible. Kanban makes priority changes pretty easy.- Not sure what you mean by "on-call can't complete in time". On-call-related tickets end up on the backlog so anyone can tackle them, not just the person that filed it.- We have a pretty good on-call culture and the teams are very sensitive to other teams' pain. If we're being paged about an issue with another team's code (shouldn't happen too often), there's always the option to page their on-call and triage together.- We track operational churn, send a weekly on-call hand-off email to the team with notes about the pages and steps taken, and have operational review meetings where these emails are reviewed (+ any other operational matters) and next steps are determined. Maintaining transparency around operational pain and building structure around the follow-up process has been really helpful in reducing our weekly churn.- I think the ideal number of team members that should be on-call is around 5 to 8. Anything less than that and rotations become very frequent and a burden on the individuals. Anything more and the rotations are so rare that every on-call feels like the first time.- Last piece of advice is to FIX THE PROBLEMS (or change your alerts). Build whatever process you need to make sure that you have plenty of leeway to address anything that pages you constantly. Don't get overly attached to the alerts - you might find that changing sensitivity thresholds or deleting the alert altogether might actually be the right answer (please don't do that for the wrong reasons though :)). If you're paged for a bunch of things that you can't fix, you're probably doing it wrong. Just like technical debt, if you don't tend to on-call issues they WILL get out of hand. Picking away at it slow and steady will almost certainly help reduce it from a torrent to a trickle.N.B. I work at PagerDuty. A bunch more suggestions here: <a href="https://www.pagerduty.com/blog/tag/best-practices/" rel="nofollow">https://www.pagerduty.com/blog/tag/best-practices/</a>.

eecksover 9 years ago

We have no developers on call

cawover 9 years ago

I've worked two different on-call systems. The first was an enterprise company with a very defined process. 1 hour of pay for every 10 on-call (even if you're salaried, they figure out your hourly pay for this). Must be able to respond and/or get into the office in 1 hour, otherwise they call your manager to get someone working on the issue. Our 24/7 support would triage all issues before escalating on the on-call engineer--they were able to resolve most issues. Rotation period was roughly 1 week in 3 or 4 on a volunteer basis. We did have a group of software engineers who had an on-call rotation for the internal applications that powered the business. If I needed additional support for the application I could tell our 24/7 support folks to page/call them and they would hop online, which was needed on rare occasion when I couldn't fix an application error or a bug slipped the manual and automated QAs.My current on-call system is considerably different as a startup. Most weeks there will be no pager alerts, some weeks will be particularly bad because something fell out of stability due to some other changes. There is a primary and secondary on-call. Each level has 20 minutes to respond and get online before it escalates (no office requirement since we're cloud based). The rotation goes through devops and all the software engineers, so you're on call for 2 weeks in 8-10 with no extra pay. I wouldn't recommend this method of scheduling because it's mandatory for all and some people don't take the duties seriously because nothing bad will happen if you let your pager slip to the next line, other than irking your coworker. There's no incentive to learn how to do the repairs or do them. I've seen a lot of "Oh my phone was off, I didn't realize I was on-call" that never happened at my previous job with the volunteers for extra pay. Having the secondary is nice because it is a guaranteed person you can escalate to for help on a complex issue, and they are available if you need to be indisposed for a short period of time.About your specific questions: Your on-call duties are to communicate, fix what's broken, coordinate any additional escalations that need to happen, and most likely host the postmortem. Nothing more. Update your status page, send out an alert internally that X is broken, etc. When it's fixed and normal service has resumed, communicate that as well. You're the point of contact for anyone with questions about the issue, not anyone you had to bring online for a fix because if they're interrupted they're not fixing.Don't start non on-call work. Fix to the extent that you know how and will make it stable until business hours. Not everyone has coded every system and knows the "permanent fix" for every issue. Your priority is based on what is broken and the business criticality of it. If multiple systems have failed, you fix the most critical ones first, which are normally the customer facing ones. There should be no "existing on-call tickets" because on-call bandaids and makes a high priority issue in the normal work queue.If on-call can't complete the work in a reasonable amount of time, it may make sense to raise other people who can help. If it's going to take you 4 hours to get the $CRITICAL_FUNCTIONALITY back online, but getting Joe to help you will only make it take 1, then by all means try to get Joe if he's available. Again, based on your rotation he may not intend to be available, may not have his phone on, and may not want to take the call.If you're dependent on 3rd parties, your application needs to take that into account. If you wake up because the external API is down and your functionality is down because of it, all you're going to be doing is losing sleep waiting for it to come back up to send out the all-clear. This changes somewhat if your external API comes with a SLA and a telephone number -- by all means call and start their triage chain.A good thing to have is a sync up meeting with the on-call folks so you can establish patterns in the alerts that may not be evident by any one person.

INTPenisover 9 years ago

* expected duties (only answer on-call, do other work, etc)Expected duty is to solve the problem but there are often escalation paths to take. If a problem is not solved within a required SLA period then the company can be forced to pay the client penalty fees.* how deep does your on-call dive into code to fix, vs triaging, patching, and creating follow up work to fix permanently?They go as deep as necessary, or as deep as is dictated by an operational manual pertaining to that particular client/environment.* priority order of work (on-call tickts vs regular work vs existing on-call tickets vs special queue of simple tasks)On-call is always higher priority since it's an add-on service that clients pay for.* what happens when the current on-call can't complete in time?See above, penalty fees mostly.* how do you manage for other teams' risk? (ie their api goes down, you can't satisfy your customers)Not sure I understand the question, if an API goes down and affects our services then that API needs to be monitored and handled by our on-call team.* any other tidbitsI'm not in the on-call team but I stay available for specialized expertize if the 1st line can't solve an issue.I know how they work though so here's one example. It all depends on the clients SLA but let's say the client has 99% uptime, 24/7 on-call duties in their contract.In that case one person out of 5 rotates an on-call device (phone) every 5 weeks. In the strictest of SLA they're required to respond within 15 minutes and sometimes have a solution within 4 hours. This varies wildly from contract to contract.Of course an incident manager is available, redundantly, and is tasked with coordinating skills between departments to solve an issue within the designated SLA.Alerts come into the device and the tech can respond to alerts directly via the device to acknowledge, force re-check or simply disable. There is also a more featured web interface for the alerts to access via a browser.Alerts are sent with SMS through a self-hosted gateway. Directly attached to the monitoring server, not using any e-mail translation API.Alerts are logged, and in some cases a ticket is created for an alert.Preferably a manager should work out the on-call schedule, but techs often trade weeks and are more than capable of handling it themselves.They receive an added monthly compensation including overtime. So any work must be reported in a time reporting utility to eventually lead to payed overtime depending on the contract it pertained to and the time when it happened.