Hi,<p>My company does 24/7 devops for some clientele. We are a team of 6 for reference. We have been doing this for many years since before "devops" was the term for it. Some of the platforms we have also built for these clients, and some we simply manage, or we have only built+manage the automation.<p>In terms of answering your question:<p>- expected duties. In terms of what we are selling the client there is a long specific list of what exactly we can provide the client. In practice however the list is exhausting, and intended 1/2 for CYA, and we will fix whatever issues arise. Some of that is seeing things that need work/adjustment/tuning/optimizing before problems arise, and then as you can imagine the whole being-on-call thing means that you must be an expert at resolving issues that no one saw coming. In terms of SLA stuff, we are supposed to be on-scene (digitally) in 15 minutes, but we are never that slow.<p>- how deep. Think of it in phases. When something needs fixing, you must quantify the urgency. If there is an urgent fix needed this is ground zero. You do whatever you can, if that involves waking people up, or checking out code bases that are new to you and patching some developers code in the middle of the night, its all on the table. In ground zero mode, you do whatever it takes. Once the bandaid/fix/solution is in place to "stop the bleeding" as I like to say, then there is follow ups, that may include working with the client to have them implement the elegant permanent fix, or if they are in over their head, sometimes that lands on us too. But it serves no-one to have a bandaid that is going to just get ripped off every night. So we see the problem from A-Z, even if our offramp is on B, or M. If it's not urgent, we will just wait for tomorrow to discuss with the client, it falls out of the scope of on-call.<p>- priority. Well, on-call work is billed and agreed different to ticket work since we are primarily a consulting company. So these are different buckets. But we also have our own products that we own 100%, and the priority is quite easy. If something is broken/down/about-to-break, it trumps everything else. Regular tickets are great, but they are meaningless if the platform for which your developing them against can't stay functional. On-call work rarely has a "queue" or even really a ticket system.<p>- there is no "complete in time" it's either done or you failed. I say this, having failed too, and it sucks, but it is what it is. If something breaks, and you can't fix it, you don't just go home. But.. sometimes you do, but you walk away from the scene with your tail in-between your legs, no one plans for that.<p>- managing other teams risk. Communication. Putting energy in ahead of time, and bringing things up before they break is huge. Also if you say "Hey you should turn left, there is a cliff!", and the client is insistent on turning right, this can do two things. A - they know and hopefully its recorded in an email or in a meeting that you wanted to turn left. B - if your absolutely certain they are going to run over a cliff, but your still on the line / have to support the darn thing anyway, you can quietly put a bunch of pillows at the bottom of the ravine, and prepare for the inevitable. When the car goes over the cliff, and everything almost grinds to a halt, and you manage to have been correct about it, but also managed to bail them out, you score a lot of gratitude from the client, and ongoing future relations.<p>This is all from the point of "consulting" mostly, but I have done this same type of work for many large companies directly on their payroll too, it all applies, and the bigger the company the more it's like consulting in some way also, because bigger companies are more compartmentalized. I have also done this in many small startups where your all just one small team.<p>Holidays and vacations are important, but they will never truly be the same after years of this. We are pretty good at it now, and we really do try to keep everyone up to speed with where "the bodies are buried" with all of our clients infrastructure. That is the hardest part. Everyone can look at a perfectly groomed wiki or set of 100% refined chef/puppet cookbooks/modules, but the world isn't perfect. So the hard part is learning how to take the punches with elegance, and people need a break. It really does take at least 3-4 people to have a not-insane 24/7 schedule.<p>We generally plan about 1-3 weeks ahead depending on the time of year and put it into our scheduling system, which is also the system that does our paging. We rotate weekends, and work it out amongst ourselves. Some people have things like date nights that we know about and try hard to not screw up.<p>Don't build your pager system yourself, you have enough stuff to do. I won't plug them because I don't want this to sound like an advert, but do yourself a favor and pay the money to a company that specializes in bubbling up alerts to notifications that can do phone/sms/email/ios/android/push/etc. These have really helped us manage the insanity & schedule.<p>If there is any advice, simply don't be frightened to say when you don't know something, even if your tasked with fixing it. There is little room for ego when your the one that ultimately has to be responsible. Being on call means you can't pass the buck, and the sooner you know what you don't know, the sooner you're able to learn it!<p>Edit: Typos