TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Being on call sucks

290 pointsby bobbiechenalmost 3 years ago

61 comments

sam0x17almost 3 years ago
Side note -- I know a lot of early YC startups like to play things fast and loose, but you really should compensate your engineers for after-hours emergencies if they are already working 40 hour weeks. Morally and employee-retention-wise it&#x27;s the obvious thing to do, but beyond that in certain states and jurisdictions you can easily run afoul of local labor laws if you try to require employees to do things outside of their regular hours without compensation. Another very important angle to consider is you want to incentivize your employees to answer the call to deal with these issues. If there is no incentive structure in place, they might not take it seriously, which in itself can be a security and stability risk that could threaten things like your SOC-2 compliance, if you have it.<p>When I worked as a full time DoD scientist, there was a very set-in-place system for dealing with these situations, and it was normal to pay double overtime for any hours spent by an employee on after hours emergencies. This is the right way to do it. Pre-series B startups largely don&#x27;t do this, but once you get to series B and C suddenly it becomes a thing because companies realize they either have to legally, or have to to prevent their employees from churning and to protect themselves from people not showing up to put out the fire.<p>Just do it, and do it early. Do it before Series A. That&#x27;s the advice I give my consulting clients, and the approach I take with my own companies. By compensating your employees for this time you also take what would be a red flag for many would-be employees and turn it into an exciting perk.
评论 #32168402 未加载
评论 #32167906 未加载
评论 #32166494 未加载
评论 #32168441 未加载
评论 #32166263 未加载
cube2222almost 3 years ago
On-call is just fine as long as:<p>1. There is a rotation.<p>2. There are few pages, preferably the median should be 0 per week.<p>3. Spurious &#x2F; non-actionable alerts get fixed right away (with very high priority)<p>4. You&#x27;re not up more than 1 week per 1-1.5 month.<p>5. You subtract middle of the night pages from your next working day, with bad nights resulting in a day off. Being on-call doesn&#x27;t mean working overtime.<p>As with most things, the core idea is not bad, it&#x27;s the execution that matters.
评论 #32163265 未加载
评论 #32166028 未加载
评论 #32163443 未加载
评论 #32164254 未加载
评论 #32165137 未加载
评论 #32168905 未加载
评论 #32165584 未加载
Aeolunalmost 3 years ago
&gt; Does anyone have One Weird Trick™ to fix it?<p>My one weird trick is to have a zero tolerance policy for flaky monitors&#x2F;tests. If it’s not accurate, we either have to drop everything else and fix it, or disable that alarm entirely.<p>Like they say, normalization of deviance is real, and the only way to fight against it is to have every form of deviance be a problem.
评论 #32167498 未加载
评论 #32167754 未加载
评论 #32169761 未加载
评论 #32169473 未加载
oogalialmost 3 years ago
After years of iteration, here’s what our team does.<p>The team is remote and distributed across multiple time zones ranging from West Coast US to Western Europe.<p>This gets us as close to round the world coverage as we can have.<p>There are two people on call for each shift, each shift lasts a week.<p>It will typically (but not always) be one person from US and one person from UK&#x2F;EU. This helps reduce the single personal cost and spreads it out so what might be night for one person, is morning for the other and vice versa.<p>All of our alerts are prioritized&#x2F;categorized to help prevent alert overload.<p>For example, an alert for a test&#x2F;QA environment will not fire outside of business hours, and it has a much longer time before it’s required to be ack’ed or resolved.<p>There are two on-call rotas: critical and non-critical.<p>Critical, production-impacting, and&#x2F;or client-facing alerts are dispatched to the critical rotation.<p>The non-critical rotation only escalates alerts during business hours, again, with a more lax timeline for acknowledgment or resolution.<p>People are not part of both rotas at the same time.<p>If there’s a big enough incident, the folks on call get to take off that next working day or the next one.<p>I (the manager) am on call 24&#x2F;7 for escalation.<p>Anything that is an annoyance during on-call is a candidate for review and change.<p>That can be anything from thresholds to code to upgrading some IaaS&#x2F;SaaS subscription. Or even straight up disabling the alert if it provides no value.<p>People can swap on-call days as they want.<p>Typically, this happens if there’s a birthday, personal event, or PTO, and it’s worked out among team members. If no one else is available, then I’ll take their shift and act as primary.
评论 #32164236 未加载
评论 #32165658 未加载
sethammonsalmost 3 years ago
My last gig, on call worked well, I thought, for a few reasons: it was our services that we wrote, it was 1 wk out of 6 that you were on call, we heavily prioritized fixing unactionable alerts and automating fixes -- every alert had a runbook entry that described the non-automated fixes, and while on call your or sprint commitments were not counted,<p>That last point was very nice as it meant you could work on whatever you felt was most important for quality of life improvements all week long while not fielding on call issues. This meant that I looked forward to on call.
评论 #32167314 未加载
评论 #32167323 未加载
评论 #32165426 未加载
评论 #32165855 未加载
Tallianaralmost 3 years ago
My company(UK) recently tried to force on-call on all engineers.<p>The initial wording was very restrictive, like 5 minute acknowledgement time and 15 minutes at-laptop. 24&#x2F;7 for 7 days. They tried to have this implemented without any extra remuneration or perks for the on-call engineer.<p>On top of it possibly being very illegal, it seems very immoral to spring something like that on people that did on agree to it when they took the job.<p>I fought for it and I got them to change their policy in 2 mostly meaningful ways:<p>- It&#x27;s an opt-in method<p>- On-call engineers get paid extra for just being on-call and get extra time off whenever they need to actually do something.<p>This makes sure that you only get people actually willing to do it and there is an incentive. I think it&#x27;s been quite a successful program!<p>Luckily I didn&#x27;t need to get them involved, but in the UK there are unions starting to form for tech workers, I suggest you join one like <a href="https:&#x2F;&#x2F;prospect.org.uk&#x2F;tech-workers" rel="nofollow">https:&#x2F;&#x2F;prospect.org.uk&#x2F;tech-workers</a>
评论 #32163618 未加载
评论 #32167114 未加载
评论 #32165438 未加载
tayo42almost 3 years ago
Yeah oncall is horrible. If things needed to be up 24&#x2F;7, then some team should be staffed 24&#x2F;7 around the world.<p>The worse part of oncall is the control of your life it has. for one week I can&#x27;t do anything I would normally do. (if your company actually compensates for this, let me know where i can a apply, or better, if it doesn&#x27;t have oncall at all!) Of course managers are never oncall 24&#x2F;7. The worse is they give the excuse well im on call all the time by default since im the one manager. But theyre not reorganizing their life and putting their off work hobbies on hold becasue of it are they?<p>&gt; a monitoring change that fixes some flaky alert that might page somebody about once every six weeks.<p>These kind of things suck. I was on a team where we had tons of these, 10 alerts like this mean your getting pages all the time. No single alert is worth the time investment. Worse was a manager insisted there will always be a base line of alerts that go off and we will just live with it.<p>Teams never seem to understand how to alert on stuff. Ive been paged for things going off, that might indicate a problem, then you get stuck sticking around because someone else wants to just wait and see what happens. &quot;We should just be cautious&quot; Its impossible to push back on these things, your just going against someones gut feeling, like maybe one day we will want to know, and everyone needs to protect them selves.
评论 #32163640 未加载
评论 #32169856 未加载
kayodelycaonalmost 3 years ago
On-call is even worse for people with disabilities. I quite literally can&#x27;t do it unless I stop taking my antipsychotic.<p>Under ADA, I can not be placed on call, regardless of policy, nor can I be discriminated against for that. On-call is not an essential function of being a software developer, with very few exceptions—all of which have nothing to do with &quot;policy&quot; or &quot;fairness&quot;.<p>Needless to say, companies (and some coworkers) really don&#x27;t like this.
评论 #32165549 未加载
评论 #32169777 未加载
ahnbergalmost 3 years ago
I&#x27;m one of the few persons I ever heard of that actually enjoyed being on-call. I believe it goes with my puzzle problem solving mentality to an extent. Being randomly challenged with a problem to look at where you might not know the solution, simply excites me.<p>Combining on-call duty with an approach of weeding out repeating issues, build better systems and ensuring that unnecessary calls don&#x27;t happen is key of course, being woken up 25 times for silly predictable errors is pointless and draining.<p>And finally having an employer that doesn&#x27;t expect you to be in at 8am if you&#x27;ve been up all night is also very important, catching up on sleep is necessary to manage your balance and health. But given this freedom, I totally dig it. :)
评论 #32165323 未加载
oofnikalmost 3 years ago
A major issue with on-call, and certainly one I&#x27;ve encountered multiple times, is the high likelihood of moral hazard - the people who are responsible for addressing incidents are not the same people who designed and maintained the system at fault. This results in the former team feeling powerless to put out fires which could have been prevented by more robust design, and the latter team having no incentive to improve reliability.<p>SRE gets this right, at least in theory, by requiring that all production systems be reviewed and approved, including observability and incident management procedures, prior to entering service. This ensures that there is some shared responsibility across teams for maintaining uptime.<p><a href="https:&#x2F;&#x2F;sre.google&#x2F;sre-book&#x2F;being-on-call&#x2F;" rel="nofollow">https:&#x2F;&#x2F;sre.google&#x2F;sre-book&#x2F;being-on-call&#x2F;</a>
评论 #32164686 未加载
评论 #32167216 未加载
评论 #32165486 未加载
bennysomethingalmost 3 years ago
Being on call should just be paid at the normal hourly rate. It&#x27;s an outrageous setup up: we need you to available but don&#x27;t want to pay you for it.<p>I did it for a few years and it made me pretty depressed. By the end of it I stopped giving a shit and just resumed my normal Friday night of having a few beers (or a lot of beers). At first they offered me a rate that was less than min wage. I said no until it was an acceptable rate. Every other company seems to pay terrible on call rates in the UK. It makes me angry just thinking about this period of my life.<p>Seriously, if a company wants to make money doing SAAS then they can&#x27;t expect to steal employee time. My advice, if asked to do it, refuse until they offer very good money for it. Good luck to them recruiting a system expert to replace you.
评论 #32169912 未加载
评论 #32171484 未加载
angarg12almost 3 years ago
This article should be called &quot;being on a sucky team sucks&quot;.<p>It sounds like OP just has experience with one bad on call. I&#x27;ve also been there. I even heard of teams with +100 high severity issues per week. But it doesn&#x27;t need to be.<p>My current team has the best on call I&#x27;ve ever experienced. It&#x27;s a mix of a lucky product and some discipline. It isn&#x27;t rocket science really.<p>If a team is drowning in ops here are two easy techniques I&#x27;ve seen work well.<p>1. Whoever is on call, their job isn&#x27;t only to answer pages, but also to improve the system. This solves the dichotomy of feature work vs ops fix. The ops fix is your job for that week.<p>2. Have team wide (even org wide!) fix-it days, where everyone works exclusively in operational issues.<p>Again, we didn&#x27;t invent this. Look at Google&#x27;s SRE books on reducing toil. You can adjust the ratios of feature vs ops work as needed.<p>And if you are in leadership, please acknowledge, celebrate and reward operational work. People tend to work in what they perceive as being valued.
评论 #32166771 未加载
ggeorgovassilisalmost 3 years ago
I manage a team which operates services (among other things) for clients. Our aversion to being on-call drove us to build robust systems, automate the heck out of everything and monitor as much as possible. That allows us to spot issues during the day shift before they become problems for the night shift, so on-call duty became over time a relatively relaxed affair for the team.
donarbalmost 3 years ago
Years ago, I was a second-shift operator in a computer center for an insurance company. We ran production jobs on an IBM mainframe. When jobs would crash we would write up the error on an ABEND form (IBM called crashes ABENDs for ABnormal END), collect the printout and call the programmer responsible. One night a production job crashed late in my shift about 10:30-11 pm and I woke up the programmer responsible. He seemed really groggy and it took a few minutes for me to describe the crash to him. I would always try to be helpful and suggest options for recovery (you got to know the programmers and what their recommendations would be based on the type of production job). Usually they would hold the job, restart it or say they were coming in to fix, this was back in the day where if they could log in remotely, it was with a clunky CRT terminal.<p>The programmer told me to just restart the job, I noted that on the form. I came in the next day and my boss called me into his office, his boss was there too. They wanted to know why I restarted the job, which caused all kinds of corruption to the database. They had spent the better part of the day recovering the database, then running the batch job, which meant that that system was unavailable for use by the agents.<p>The programmer swore up and down he did not tell me to restart the job, said I never called him! He was that deep into sleep. But on the form I noted the time I called him and his response to restart the job, so they believed me.<p>This highlights a problem with people responsible for multi-million dollar systems being woken up in the middle of the night and having to make quick critical decisions.
humanianiaalmost 3 years ago
Getting woken up in the middle of the night by a blasting noise is traumatic. I still have PTSD from my on call time. I swear that to subject a prisoner to that type of condition, where they are woken up at arbitrary times and forced to solve complex problems, would be considered cruel and unusual and inhumane treatment.
评论 #32168648 未加载
评论 #32174459 未加载
indigochillalmost 3 years ago
My team has what we call &quot;the strike team&quot; which is not just on-call but even during the day, your job is basically to make everything more robust (as opposed to what we do normally, which is work on new systems and features). So just last week or so there was an alert on Sunday that I then spent the week to fix permanently. These are also services that my team are the sole developers on so I know when I fix something it will generally stay fixed.<p>On top of this, we have a rotation so each of us is only on-call one week out of maybe every four or five. And although I agreed to it mostly because it was a condition of the job and I wanted the job more for learning how the team worked than I cared about the money, the compensation for being on-call is actually pretty good even if nothing happens. And if on-call lands on a holiday, we get the holiday time as vacation days to spend later.<p>So overall, while I would prefer not to be on-call, I feel like our team implements it about as well as can reasonably be expected. I expected it to drive me crazy, but it actually hasn&#x27;t yet.
评论 #32163762 未加载
wonderwonderalmost 3 years ago
Favorite part of quitting my last nightmare of a job was that I continued to get text notifications that I was the on call person for the night&#x2F;weekend for ~6 months after I left. I slept well each night knowing I had the something is broken notification number blocked
评论 #32169659 未加载
masukomialmost 3 years ago
despite working on a financial product where a production fire could actually cost normal people money (worst case scenario) I don&#x27;t mind being on call at all.<p>Why? Because I never get called. Because we wrote tests. Because we were effing careful. Because we have safeguards in place. Because IT has put hardware redundancies in place and because we have circuit breakers.<p>Being on call only sucks if you&#x27;re being made responsible for crappy software that breaks regularly.<p>That&#x27;s a problem if you aren&#x27;t empowered to make it better. If your team isn&#x27;t encouraged to care. Better... find a place that does care. Find a place where people hate the idea of being awoken at 4 AM and do everything in their power to make sure it can&#x27;t happen to anyone.<p>The last time i got called it was because AWS went down, and we couldn&#x27;t do crap about that.
评论 #32166824 未加载
sylensalmost 3 years ago
I have an opportunity to move into an all-remote role that would require me to be on call 24x7 for one week every 3 or 3.5 months. I asked about the frequency of incidents that required the on call person to be pages and it looked like on a bad week, it was about 7 total - a good week was 0. I’m personally torn on whether or not I’ll be okay with the on-call lifestyle so I appreciated this piece for giving me some food for thought.
评论 #32164918 未加载
评论 #32175465 未加载
zeckalphaalmost 3 years ago
&gt; Last year as part of my move to product management, I was removed from the pager rotation for my old team.<p>Product should be on oncall rotation, too, even if it is as a shadow. This is an important feedback mechanism about the choices they make.
评论 #32167820 未加载
g051051almost 3 years ago
&gt; Being On-Call Sucks<p>Which is why I won&#x27;t do it. Either hire people specifically to do it, or provide sufficient incentives so that people volunteer. Being a good &quot;first responder&quot; is a skill and not everyone has it or wants it.
评论 #32164401 未加载
评论 #32165521 未加载
sebastianconcptalmost 3 years ago
I&#x27;ve didn&#x27;t got many but the few &quot;events&quot; that pulled me to help, one had the aim to almost screw up great a romantic date friday night, and the other had even more accuracy (details NSFW).<p>Being on-call is a deal breaker &#x2F; non-starter for me.<p>I see myself helping the guys in operations to stay happy and well equiped to deal with keeping the service up but I am not doing operations myself.<p>Ever.<p>A challenge for this is that when the devs keep being hammered with new features and they cannot allocate to improve the system defensivness against these &quot;events&quot; then the thing gets tricky.
dml2135almost 3 years ago
I&#x27;ve been at my first software engineering gig for 4 years now, we don&#x27;t have on call, and I&#x27;ve sworn to myself that I will absolutely never take a position with on-call.<p>I&#x27;m a bit worried that it will hamper my career prospects, especially as I&#x27;ve moved to doing more backend work. But I just can&#x27;t imagine being tied to a work phone on my personal time -- I have a hard enough time enforcing work&#x2F;life balance as it is.
评论 #32166479 未加载
评论 #32167940 未加载
pelasacoalmost 3 years ago
Be On-Call is working time. Therefore you should be paid for that.<p>In the team that I work now, it&#x27;s voluntary to join the On-Call rotation. People get paid for that. To be on-call and for any incident that they have to actively work. We have as well a partially implemented &quot;follow-the-sun&quot; monitoring team, but that&#x27;s by accident (our team has members from West coast US, Europe and Australia, but weekends are covered by On-Call)
kakwa_almost 3 years ago
Compared to the article, many on-call situations could suck even more.<p>In a past assignment, I was on-call for a service managing thousands of customer instances each with a different version of an highly customizable application (basically an SDK) developed by another team.<p>The result was on-call being basically a working day, getting paged on average every hour and at worst needing to track 2 or 3 war rooms in parallel.<p>Also you felt powerless because while being accountable for the availability of the service, a lot of the issues actually came from customer implementations overloading their instance or product bugs which would get ignored for years and if a fix was done it would take even more years to be deployed due to the difficulties in updating the application.<p>The only saving graces were:<p>* It was in a large international org, so it was &quot;follow the Sun&quot;, no midnight calls, only 8 hours shifts (11:00 to 19:00)<p>* I live in a country were it&#x27;s a legal requirement to compensate employees for being on-call<p>Now, I&#x27;ve switched to a team were we handle on-call on services we control end to end (code and deployment), and it has been far less stressful.
ranger207almost 3 years ago
I&#x27;ve always been on the ops side so I view being on call as an intrinsic part of my job, regardless of where I work. My current company does it well I think: each team has its own on call rotation covering their own systems (with the ops team I&#x27;m on being a backstop) which means that you only get paged for your own bugs. In addition, whoever&#x27;s on call that week is the designated &quot;person to be disrupted&quot; if someone needs something from your team. If someone needs ops help, they&#x27;ll go to the ops on call guy first. If I need to ask a question about one of our backend processes, I&#x27;ll ping the backend on call guy. We have dedicated on call channels where all the alerts and pages go and where people post for help. Most importantly, leadership is just as invested in the product and happy to jump on as we are. The only 4AM page I&#x27;ve had yet, my boss and his boss were on the call before I was. Overall I&#x27;m happy with it
jdddddalmost 3 years ago
If you&#x27;re on call 24 hours and you&#x27;re not getting overtime pay you&#x27;re being taken advantage of.<p>Stop giving away free labor. On call is a scam for companies to get away with not staffing properly.
jpollockalmost 3 years ago
I lead an oncall rotation. It&#x27;s important to stay on top of the pages&#x2F;alerts, or else the rotation will rot and enter a death spiral where you don&#x27;t get any sleep.<p>Every page (particularly the nighttime ones) are root-caused by the team every week. Each page gets one of four things:<p>1. Fix the code to handle the situation.<p>2. Tune the alert to increase the signal - resulting in a more actionable page.<p>3. Re-route to a more appropriate team.<p>4. Remove the alert if it doesn&#x27;t help us keep the systems running.<p>So many pages were &quot;informational&quot; that we couldn&#x27;t action and didn&#x27;t indicate a problem that needed to be dealt with. Many others were bugs that people knew about but hadn&#x27;t worked on because they didn&#x27;t know it was waking us up! :)<p>Now, we get our sleep and people are asking to join the rotation!<p>Paying people to take the pager does not help when the rot sets in, but it does help encourage people to pick up extra shifts.
malwraralmost 3 years ago
A personal horror story: I have an on-call shift that is 24&#x2F;7 for one week, only 3 other people are on the shift. Alerts are frequent, noisy, and happen almost every night so you are practically guaranteed to never sleep fully that week, and the sheer breadth of services + teams we’ve accumulated and lack of any clear specificity in alerts means that I’m almost always at least somewhat confused as to if something is actually broken and, if so, how I actually fix it, even after 3 years (two of which when I lived alone during the pandemic). This was my first job out of college too, I was fully convinced that if I fucked up even once or called the wrong person I would be fired and all of the effort I put into getting this career would be meaningless.<p>I didn’t even get any prep or mentorship, they just suddenly put me on-call during a major product launch. No extra pay or time off btw, just gotta continue work if you were up all night trying to fix some thing with vague priority &amp; vague symptoms (too much latency on random offline service I’ve never heard of that turns out to be a dev experiment, latency being caused by a laggy database that you find out by finding some random message in splunk and regexing it out and into a graph).<p>I’m definitely a changed person after it, I don’t really… react as much anymore and flinch every time I hear a default iPhone text notification or ringtone. I don’t know how to fix it either—I don’t know if our team has enough people to spread the load out and I can’t think of any better way to keep track of failures in this labyrinth of services, and onboarding people to the point where they can actually take an additional shift is usually 8-12 months. Even experienced people still get ambushed by new services with zero documentation.<p>Pros though: I don’t really experience much stress or uncertainty anymore in hard situations, and I seem to be much better at problem solving! I’ve also managed to keep my prestigious job with life-changing pay, which feels much more personally fulfilling than coasting at Google (even if it’s for the wrong reasons).
bogomipzalmost 3 years ago
I&#x27;m consistently surprised by the number of startups that have distributed teams with people dispersed around the globe and then have an on-call rotation of one person having on-call 24&#x2F;7 for a week straight. This to me is a management red flag.<p>One of the first question I used to ask when interviewing for roles that had an on-call component was &quot;what&#x27;s the on-call expectancy?&quot; I would recommend asking questions about the &quot;on-call&quot; experience to everyone on your interview loop as well. This is often very informative.<p>Expecting someone to have no life outside of work 24&#x2F;7 for one week every 5 or 6 weeks is a real quality of life issue. And to do so without offering extra compensation just seems exploitive.
jamal-kumaralmost 3 years ago
I used to do incident response against hacking attempts, DDoS attacks, threats like extortion... Like all of these on a regular basis. Something new and extremely stressful every month or so. Here&#x27;s a fun fact: when your adversaries are on the other side of the planet they get to wake up early and start their day with the full knowledge that you may be just passing out at like 9pm in your locale. I used to answer my phone with &quot;WHO THE FUCK ARE YOU AND WHY THE FUCK ARE YOU CALLING ME&quot; at hours past that at first. I don&#x27;t really know how I got used to it (Though living somewhere where stuff is happening 24&#x2F;7 anyways helped), but the compensation was decent.
valenterryalmost 3 years ago
Why not just have on-call being an auction?<p>Some people hate on call, some people love it since they are usually at home playing games anyways or don&#x27;t mind so much being disturbed during sleep and love to make some extra money on the side.<p>So just auction the on-call times and have employees bid for them. Naturally, bids for on-call at christmas will probably be lower and for some other times higher. Employer can set a max. compensation they are willing to pay - if there isn&#x27;t a low enough bid, well, on-call isn&#x27;t happening. :)<p>In some companies something like that is already achieved by allowing people to switch times and also exchange compensation. A fully fledged auction is just the next step.
评论 #32166975 未加载
mancerayderalmost 3 years ago
It&#x27;s possible to tolerate on-call if:<p><i>You have the ability to make changes necessary at the infrastructure engineering stage, not at the ops ad-hoc response stage, to prevent or end problems that cause call-outs.</i><p>There are those that that view &quot;DevOps&quot; not as an engineering culture but as &quot;people who fix problems in production&quot; and even &quot;do releases Friday at 10pm&quot; and developers (or &quot;engineers&quot;) as &quot;people who make changes, and go to the bar Friday at 6pm.&quot; Companies that do this sometimes call themselves places that &quot;move quickly, break often.&quot; Places like this, you hope never to work for.
评论 #32169130 未加载
di4naalmost 3 years ago
The one weird trick is to pay people to fix it. There is an incentive problem here.<p>The second one weird trick is to legitimately ask people about the on call experience. Again. And again. And again.<p>The reason nothing change is because noone is incentivised to change it.
评论 #32167692 未加载
dkarlalmost 3 years ago
For larger companies: if you have active customers around the clock, have employees around the world, so engineers are never on-call outside their normal working hours. I guess that&#x27;s much easier said than done, since having employees in different countries might be complex and expensive (maybe HR outsourcing companies like Trinet solve this for you?) and you have to manage employees in time zones that are offset from leadership.
robalfonsoalmost 3 years ago
Your org needs to be at either end of a spectrum. Either on-call is mostly quiet, and non-disruptive and truly only there for huge issues that happen seldom. Or you staff up a dedicated 24&#x2F;7 team. If it&#x27;s in between you need to plan on getting to one end before you wear out your team.<p>I think on-call and the quality of life component are highly dependent on the company culture, the types of alerts, etc.<p>My org on-call was laid out like this:<p>3 days at a time and then a break of X days (depending on team size - This option was chosen by the team)<p>Comp time for any incidences (plus manager flexibility, up late fixing something no one expects you in early or at all depending on how it went)<p>We leveraged a provider to handle alert escalation, rotation, phone calls etc. If someone didn&#x27;t answer it rotated through to the next person and on up to management.<p>A regular look back at the type of calls coming in, and re-balance of alerting priorities to make sure if someone is going to get a call out of office hours, it better be necessary. We always asked &quot;Could this have waited&quot;<p>A general culture of helping out, if you couldn&#x27;t fix something you could ask for anyone else near a machine to handle it.<p>A general culture of asking could we have automated a fix for this alert before getting a human involved?<p>Almost all tools were available via mobile and you would be amazed how often you could fix something from a mobile phone. In fact I fixed some service issue in about 10s in a movie, never missed a beat.<p>Trading on-call windows was typical and easy.<p>If your org can&#x27;t do above and is truly wearing people out then you need to go the other way, and just staff up 24&#x2F;7 and let people have their lives.
jon-woodalmost 3 years ago
There are many issues with being on-call, particularly in environments where false alerts routinely happen, and where management aren&#x27;t in the rota so don&#x27;t directly feel the pain. One of my biggest though is the concept of a weekly rota, with people being on-call for a full week at a time.<p>Sometimes that works fine, and you&#x27;ll get no alerts all week, but incidents tend to cluster. If something has changed that caused an incident odds are it&#x27;s going to have knock on effects, and you&#x27;ll see more alarms over the course of a week. With a weekly rota you end up with one person handling that, who by the end of the week is completely destroyed.<p>Anywhere I&#x27;ve been responsible for setting up an on-call rota I&#x27;ve instead gone for daily rotations. That means if you were up in the night last night, someone else is going to be in the night tonight. It also means if nothing happens you don&#x27;t have to spend an entire week either cancelling plans or lugging a laptop around with you just in case.
评论 #32165415 未加载
评论 #32165421 未加载
评论 #32164537 未加载
bluesnowmonkeyalmost 3 years ago
The most important criteria for me is that I only be on call for the things I built. Night and day difference in terms of tolerability.<p>If it wakes me, I should have built it better. So I work <i>really</i> hard to build things that don’t page. So I actually want to be on call for the quiet and reliable things I built, in order to experience the reward for all that effort.
prmoustachealmost 3 years ago
I used to be on call too and it sucked. At some point I even bought a personnal pocket laptop specifically so I can still go on bicycle rides when on-call, I also did paddle surfing with phone connected to a bluetooth speaker so I can go back home in case of alert.<p>Now that I am working for a company who has team members in all timezones this on-call thing doesn&#x27;t make sense anymore. I understand some companies are not in a global market in term of customers but I don&#x27;t see why their IT&#x2F;dev teams couldn&#x27;t employ people from all over the world. You are much more efficient responding to alerts during your own local office hours than when you were just woken up 30s before in the middle of the night and can barely open your eyes with the laptop backlight on.<p>Obviously doesn&#x27;t apply to the one taking care of datacenter duties.
nobleachalmost 3 years ago
I don&#x27;t mind being on call at my current place. I&#x27;m an app owner so I absolutely am the right person to call if things really get out of control. (So I don&#x27;t even mind being in the escalation list if I&#x27;m NOT on call).<p>What I cannot stand is when folks insist on not tagging errors that they have no intention (or no timeline) on fixing. If you have no plan to fix something, stop waking me up at 3am for an alert. At a former job, I used to put this out to my superiors constantly. &quot;Please tag this alert, get a ticket into our backlog so we can prioritize a fix&quot;. Alerts should be for exceptional situations. We&#x27;ve allowed services like New Relic to convince us that Appdex scores always translate to losing money.
stuff4benalmost 3 years ago
I used to be an SRE manager for an on-prem cloud at a previous job. What they don&#x27;t tell you is that even though your team of SREs that you manage has an on-call rotation, whenever there&#x27;s an incident, they call the on-call and then call their manager and get everyone on a bridge call until the incident is resolved. Which meant I was on-call 24x7x365. I barely lasted a year before I bailed. I am a single parent 50% of the time (shared custody of kids) and it&#x27;s hard to be on an incident bridge until 3am and then have to get up at 6am to get your kids ready for school.<p>I&#x27;ll never accept another job in my lifetime that requires on-call rotation (if I can help it).
fordalmost 3 years ago
I think more teams should recognize that <i>some</i> features&#x2F;products&#x2F;functionality (especially at large companies is not worth paging for.<p>Obviously if a customer can&#x27;t access their bank account or sell a stock then someone needs to be woken up, but if an endpoint has high latency or a business can&#x27;t fire off a marketing campaign ASAP it can mean entire teams of engineers can avoid the worst parts of oncall - the feeling of dread of waiting on the &quot;ba-dum&quot; in the middle of the night and questions about making plans outside of work - just by shifting oncall to noisily page during business hours only.
atemerevalmost 3 years ago
I enjoy being on call. I have heavy ADHD, so I am not a fan of routine, well-defined time activities and deadlines, but if I get this call at 2AM, I feel intense adrenaline rush, and I feel like a hero of the night saving the prod.
评论 #32165673 未加载
pm90almost 3 years ago
The trick to fixing OnCall is to have strong Government regulations around availability of employees to be available off work hours. If that happens, employers will either drop or relax OnCall requirements, or build processes to manage it (e.g. larger teams, dedicated support teams etc).<p>It’s really that simple. Security professionals tried to get organizations to implement password requirements and MFA for several years, nothing happened. Then SoC2 and other certifications came along, boom, 2FA everywhere.
cletusalmost 3 years ago
So I&#x27;ve been oncall at two major companies (Google and Facebook) and, at least in my experience, this covered both ends of the spectrum. Basically, Google gets it mostly right and Facebook gets it mostly wrong.<p>At Google, a new service has to be supported by the team that developed it. There&#x27;a an extensive launch checklist that includes monitoring, having a runbook, etc. Here&#x27;s the most important part: you&#x27;re paid when you&#x27;re oncall. The amount varies depending on how important the service is and the expected response time but can easily be 5 figures a year. Oncall period varies but a week at a time varies with hopefully 8-12 people in rotation.<p>Too few people and people get burnt out. Even if nothing happens on an oncall shift, it&#x27;s an annoyance and a restriction on what you can do. Too many and people tend to forget what to do. So with a sufficiently large team you may end up with some people in the rotation and some people not. That&#x27;s why the compensation is importatnt.<p>Particularly large, important and mature services may enjoy SRE support. You can&#x27;t throw a service over the fence and have SRE deal with it. It doesn&#x27;t work that way. It typically needs to have been running for at least 6 months and SRE needs to be satisified it&#x27;s sufficiently reliable, stable and monitored with a good runbook. SRE support is globally distributed and <i>typically</i> means 8 hour shifts during normal hours.<p>The owning team will often still be secondary support.<p>Also code has to be owned by somebody. This may be a team but when I was there (some years ago now so it may have changed) this also meant 2 actual people (not just team aliases) had to be owners. This is to avoid abandonware. This very much is a support and oncall issue.<p>Facebook OTOH is a dumpster fire when it comes to oncall.<p>Not getting paid to be oncall is (IMHO) one of the biggest mistakes. The mantra is &quot;it&#x27;s part of the job&quot; but that responsibility is not shared equally. That&#x27;s the point of compensation.<p>My experience at Google was that issues were relatively infrequent. What I saw at FB however was that oncall could often be the only thing you did for the week. Noisy alerts, alerts caused by issues in downstream systems that you could do nothing about or would get ignored by their oncall, a bunch of issues raised that some would just ignore until they expired (or closed just prior to going out of SLA as &quot;could not reproduce&quot;), etc. You may also be dealing with code that nobody owns (or, rather, nobody takes responsibility for) for features that are live.<p>Plus the incentive structure, at least on the product side, was to ship new features. Oncall was often treated as just extra work you have to do on top of whatever else you&#x27;re doing.<p>Obviously I didn&#x27;t see how every team did it so none of this is absolute but I did see a reasonably high number of samples.<p>It&#x27;s also worth noting that not everything at FB is like this (eg the Web Foundation people were and I believe still are outstanding). Also, in high-visibility outage situations you have highly knowledge individuals who can and do get involved and know the right people to push.<p>The FB equivalent of SREs is Production Engineers (&quot;PEs&quot;). There are less of these and more services at FB are supported by the SWEs than at Google (IME).<p>I got the impression that FB processes and culture were forged when the company had less than 500 employees and they never really adjusted to the greater scale. There are a lot of things that work very well. Oncall just isn&#x27;t one of them. Nor is code ownership.
评论 #32166243 未加载
评论 #32165661 未加载
评论 #32165795 未加载
nickd2001almost 3 years ago
One way to make on-call less bad, ask your employer to let you work normal 8hr days on the weekend that you&#x27;re on call, in return for having 2 weekdays off the next week. That way you don&#x27;t have any weekend days that are lost due to having to be available. I was allowed to do that, it was a win-win. Useful for them to have really thorough extra cover over weekend, then 2 uninterrupted days in the hills for me. ;)
artisanal-oafalmost 3 years ago
Being on call is the bane of my existence. It’s also entangled with the issue that companies would never consider people working a night shift even tho tons of devs prefer to code at night (and they could actually get things done without constant context switching). How is it that we are expected to be on slack, checking emails, and on call after hours but employers rarely pay for cell phones or after hours labor?
评论 #32162302 未加载
评论 #32163353 未加载
zippergzalmost 3 years ago
One of the main reasons I quit being a professional software developer was the expectation of being on call. My sleep and my free time with my family are more important than any job. No matter how much I like the normal work, and how much they pay me, I&#x27;m done working jobs that insert themselves into my home life and ESPECIALLY that wake me up in the middle of the night. EVER.
coldcodealmost 3 years ago
I worked at a place in the 2000&#x27;s where our main application leaked like a sieve due to not releasing memory from a C++ framework, and two people had to take turns every other night restarting the app (actually it was chopped up into 20 different apps that had to be started in order by hand) every two hours or so. I can&#x27;t imagine they got much sleep on their night.
评论 #32164440 未加载
sllabresalmost 3 years ago
&gt; It took a few months to shake that Pavlovian association It took much, much longer to shake off &quot;J.S.Bach Badenerie BWV 1067&quot; :-D <a href="https:&#x2F;&#x2F;youtu.be&#x2F;JvxeiTq9bqw?t=27" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;JvxeiTq9bqw?t=27</a><p>I can say that I&#x27;ll woke up for years from the deepest sleep in &#x27;seconds&#x27; when hearing this melody.
ComputerCatalmost 3 years ago
The only job I ever had where I was on-call was at a discount shoe store... it was ridiculous and unnecessarily stressful.
goodpointalmost 3 years ago
The problem is not being on-call, but that most companies do it poorly.<p>Often it becomes a tool to force people to work a whole lot more.
varispeedalmost 3 years ago
I only ever agreed to be on call if I was compensated full rate plus an extra for being outside of core hours.<p>If you are on call, you are working and your job is to pick up the call as soon as you hear it and then act on it for the given period of time.<p>I don&#x27;t understand why people agree to do this basically for free.
victor106almost 3 years ago
Side note:- Increment Magazine from Stripe is one of the best tech magazines in recent times. Highly recommended
volumealmost 3 years ago
I wish all PMs would go on call for a week at least. The OP&#x27;s stint with on-call will be quite useful in his career since he can better intuitively view infrastructure more holistically. Then if the group is doing sprints, the &quot;firefighting&quot; will more easily get prioritized.
SomeCallMeTimalmost 3 years ago
I worked for one company that enforced on-call for the entire team. For a reason that&#x27;s out of scope for this comment, it didn&#x27;t apply to me, though it <i>should</i> have by the normal rules.<p>I loved working for the company, but at the same time I would have totally despised the on-call system as implemented.<p>The main problem was that I was working on client apps. My focus was an Android app, to be specific. There was a completely separate web team that developed the backend; we not only <i>didn&#x27;t</i> work on the backend, we were <i>prohibited</i> from working on the backend. I never even saw the backend code. We even asked to develop part of the backend we needed at one point, and they refused to let us, telling us that we didn&#x27;t understand their security requirements and therefore couldn&#x27;t contribute.<p>And from what I heard from people who were in the on-call rotation, every single 2AM call was from some badly designed alarm. Designed by their team.<p>Yes, every one of those alarms was fixed. Some may have been only &quot;fixed&quot; the first time, but I got the idea that each did eventually get adjusted to not be completely spurious.<p>But what really offended my sensibilities was that the backend team was pushing for the client team to be on-call to cover for what seemed to me to be profoundly poor alarm definition. They were eager to get others on board to cover the on-call rotation because <i>being</i> on-call was such a nightmare.<p>Another comment [1] points out that the median number of alerts in a week should have been zero. It wasn&#x27;t close, from what I could tell. And the whole &quot;make them eat their own dog food&quot; approach of putting the people on-call who are actually responsible for writing the code was broken by the practice of including people only loosely associated with the code (as in, we <i>used</i> it) in suffering the consequences of writing the code (or designing the alarms) badly.<p>In general, if I&#x27;ve broken something that&#x27;s affected a site or product, I&#x27;m happy to fix it, even if it&#x27;s after hours, though I prefer it to be on a &quot;best efforts&quot; basis rather than a &quot;drop everything and work on it now&quot; basis. What I don&#x27;t want to sign up for is being roused out of bed at 2AM to fix problems caused by someone who isn&#x27;t even on my team, where I wouldn&#x27;t have even seen the PR that caused the problem or any of the related code, and there&#x27;s absolutely no way I could have prevented it.<p>&#x2F;rant<p>[1] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=32163155" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=32163155</a>
silvestrovalmost 3 years ago
&gt; as part of my move to product management, I was removed from the pager rotation<p>So management does NOT bear the burden of rushed and unstable code.<p>Then management will only care about new features and not about stability. This will result in developers burning out over time.
parasensealmost 3 years ago
Story time:<p>This one time I was hired as a tier-3 Unix &amp; Linux support engineer at managed-hosting provider who shall not be named. This was back in my prime, I&#x27;d already had 10 ~ 15 years industry experience, so I was hired to deal with the really nasty problems that percolated up the chain, and to solve those problems via ongoing process-improvement loop. Over time my job got easier and easier, because lower tier engineers gained training and better standard, procedures, etc... I was on a two person team, we traded the on-call duty every other week. One day the other person left the company, and that began a comedy of errors...<p>We tried to fill the position both internally and externally, but the position was mostly vacant. Meanwhile I was covering the on-call 24&#x2F;7 on a tentative basis, until we hired a permanent replacement. Welp, long story short... I fell victim to my own success. Management observed that there was no apparent disruption to the tier-3 area of operation, and that was mostly true, so they decided to eliminate the redundancy. BIG MISTAKE!<p>I politely told management that I&#x27;d no longer be covering on-call 24&#x2F;7, and go back to every other week; That I was burnt-out waiting for a replacement, and not being able to plan vacations, or even experience off-hours serenity for fear of the on-call phone ringing. That didn&#x27;t go very well, I was told that I&#x27;d be subject to disciplinary action, and they would make accomodation for vacations, but that I had to remain on-call 24&#x2F;7.<p>So I decided to become a party animal, every other week. As soon as I left work for the day, I&#x27;d start getting drunk. I had parties every night, or went to parties, bars, or whatever... on the weekend I&#x27;d go camping at state or national parks with 1-bar of signal for on-call device, I&#x27;d be on a boat in the middle of a lake an hour away from my laptop back at the cabin. Stuff like that, pretty much I&#x27;d make myself available 24&#x2F;7, but there was zero assurance I&#x27;d be sober or whatever... every other week.<p>This one time I went into work, and somebody was joking to me about something in context, like as if I knew what they were talking about, but I had no idea. Apparently I was engaged the night before while blackout drunk, and was semi-belligerent with the tier-2 engineer, but was able to get whatever emergency escalated issue resolved in a most anti-climatic manner. The story I heard was something about slurring-out linux commands from a noisy bar, and threatening to drive home (drunk) to ssh into a customer&#x27;s system. But whatever instruction I gave over mobile did the trick, and the company barely met the contractual SLA for that incident!<p>So my plan kinda backfired. Management figured out what was going on, and while being applauded for doing my job, I was formally written-up for being drunk on-call. So I responded by stating that&#x27;s only every other week, and my on-call rotation is documented on the company calendar. I asked if I was never permitted to drink off-hours while employed at the company, and that got HR involved. Apparently it was not entirely legit to be on-call 24&#x2F;7, nor to intrude into my personal life to such an extent. So I was asked to sign something, an amended employment contract. I refused, and suddenly my performance reviews tanked the next few quarters, and was eventually let-go via a round of &quot;layoffs&quot;. The entire tier-3 was eliminated, and they went with a tier-0 ~ tier-2 hierarchy (whatever that entails). No harsh feelings, I&#x27;m still in touch with many peeps.<p>It was at this point I transitioned from supporting Linux to making Linux. Pursued my passion, and started working full time as open source developer, and lived happily ever after not being on-call ever again. Or so I thought. Turns out software development occasionally has grindy rushes to meet deadlines, the so-called &quot;crunch&quot; culture, and instead of having a company provided on-call device... work peeps were calling my private number, in the middle of the night.<p>And so it goes...
评论 #32179726 未加载
评论 #32169985 未加载
SergeAxalmost 3 years ago
Working in general sucks most of the time. Being on-call should be just compensated, and the schedule should be adequate.
z3t4almost 3 years ago
There&#x27;s always night-owls that prefer to work at night (with double pay)
ricardobayesalmost 3 years ago
After being a SWE for a decade in Europe, I have never heard of anyone in my network who needed to be on-call. Is this a US thing or only for devops? Why would a software engineer need to be on call ever? That just means the CICD&#x2F;testing&#x2F;validation pipeline sucks.
评论 #32165466 未加载
评论 #32173936 未加载
评论 #32165601 未加载