Who's on call?

165 点作者 emilong超过 8 年前

20 条评论

raldi超过 8 年前

Most interaction at Google between SRE and developer teams is mediated within the context of a "failure budget". For example, let's say the agreement between the engineers and the product and budget people is that the service needs to have four nines of reliability; that's the amount of computing and human power they're willing to pay for.Well, that means the service is allowed to be down for four minutes every month. Let's say for the past three months, the service has actually only been out of SLA for about 30 seconds per month. That means the devs have a bit of failure budget saved up that they can work with.How do you spend a failure budget? Well, let's say you're a developer and you have a new feature that you just finished writing late Thursday night, but the SREs have a rule that no code can be deployed on a Friday. If you have a lot of failure budget saved up, you have more negotiating power to get the SREs to make a special exception.But let's say that this Friday deployment leads to an outage late Saturday night, and the service is down for sixteen minutes before it can be rolled back. Well, you now have a negative failure budget, and you can expect the SREs to be much more strict in the coming months about extensive unit and cluster testing, load tests, canarying, quality documentation, etc, at least until your budget becomes positive.The beauty of this system is that it aligns incentives properly; without it, the devs always want to write cool new code and ship it as fast as possible, and the SREs don't ever want anything changing. But with it, the devs have an incentive to avoid shipping bad code, and the SREs have reason to trust them.

评论 #12448904 未加载

评论 #12445486 未加载

评论 #12446583 未加载

mcheshier超过 8 年前

Here's an idea: pay extra for on-call work. As a professional I want to fix stuff if I break it, but there's a limit to demands on my time.It's especially infuriating to spend an evening away from my family to fix a problem that someone else caused and could have fixed in 5 minutes, but I had to spend several hours getting familiar with.At this point, if management asked me to start on an on-call rotation I'd want to know how I was going to be compensated for the additional time and opportunity cost of being on call, or I'd start looking around for a new gig.

评论 #12445314 未加载

评论 #12445677 未加载

评论 #12445986 未加载

评论 #12449931 未加载

oncallthrowaway超过 8 年前

As an in-demand software engineer with oncall experience at a well-regarded company, I will not consider jobs that require me to be on call.Jobs with oncall don't offer more compensation than jobs that don't.Scheduling my life around being able to answer a page is inconvenient, and waking up in the middle of the night is something I'd rather avoid.Operational work is often not considered as important as feature development for promotions, so you feel like you're wasting your time when doing it.In my experience, system quality is completely independent of whether the developers do oncall or not. But I'd welcome objective data that proves otherwise.There is no upside for me as an individual to take a job with oncall responsibilities.

评论 #12445036 未加载

评论 #12445058 未加载

评论 #12445100 未加载

评论 #12445780 未加载

评论 #12446511 未加载

评论 #12445369 未加载

评论 #12450161 未加载

gwbas1c超过 8 年前

A few years ago I took a job where all engineers took turns carrying the pager. The reasons were that we were too small for dedicated ops resources (justified), and that the head of engineering wanted us to feel like a family restaurant (not justified.)Shortly after joining, I gravitated towards our desktop client and just couldn't keep up with all the changes on the server environment. When the pager went off, I just didn't know what to do. What was more frustrating is that our system had a few chicken littles in it, and I really wasn't up to date on the context about when "the sky is falling" really means "the sky is falling."Probably the bigger problem is that I don't consider myself an "ops" person. I prided myself in making the desktop product stable and performant; I didn't have the time to learn the ins and outs about service packs and when to reboot.I agree with the article completely, developers should be on call when their code is shipped, and while their code is immature. Just keeping developers on call, or rotating in developers who just aren't involved with the servers, is a complete waste of time. It fundamentally misunderstands why successful companies rely on specialization and divisions of labor in order to grow.I think the author is spot-on when she states "Who should be on-call? Whoever owns the application or service, whoever knows the most about the application or service, whoever can resolve the problem in the shortest amount of time."

评论 #12445831 未加载

评论 #12445833 未加载

torinmr超过 8 年前

This was a pretty interesting article that hits very close to home (I'm an SRE at Google). I think the central thesis (that developers are better at running rapidly changing products because they are able to find and fix bugs more quickly) is a bit flawed, however.The reason is that I think the most valuable contribution of the SRE is not in responding quickly to outages, but in improving the system to avoid outages in the first place. SREs tend to be better at this than developers because (a) they have better knowledge of best practices by virtue of doing this kind of work all day every day and (b) they are more incentivized to prioritize this kind of work.Because of this, the dynamic I commonly observe is that SRE-run services have fewer and smaller release-related outages because techniques like canarying, gradual rollouts, automated release evaluation, and so forth are deployed to a great extent. On the other hand, developer run services tend to have more frequent and larger release-related outages because these techniques are not used or are used ineffectively. So even though the developers can diagnose the cause of a release-related bug more efficiently than SREs can, the SRE service is still more reliable.In my view, the main reasons to have developers support their own services fall into (a) there aren't enough SREs to support everything, (b) the service is small enough that investing the kind of manpower SRE would into implementing these best practices would not be cost effective, and (c) SRE support can be used as a carrot to get developers to improve their own services.Edit: I would add that if the roll of oncall is expected to include only carrying the pager, and not making substantial contributions to improve the reliability of the system, then the author is absolutely right that having an SRE or similar carry the pager has next to no benefit.

评论 #12454913 未加载

_Codemonkeyism超过 8 年前

"The number one cause of outages in production systems at almost every company is bad deployments" [refering to code deployments]When I read post mortems from companies posted or linked here (e.g. Google, Facebook, ...) it does not seem that outages result from code deployments.From my experience of 10 years of CTO/VPE I've only seen some outages resulting from deployments (mostly because test data sets were too small and processing in production took much longer resulting in slow responses and then outage).The majority of outages linked and experienced are either from growing load, introducing new technologies (databases, deployments but the outage was not from code and usually developers could not help) or rolling out configuration changes.What would be your main reason for outages?

评论 #12444723 未加载

评论 #12444796 未加载

评论 #12445266 未加载

iamthepieman超过 8 年前

I've worked in an on-call rotation at one company and won't do it again. I was paid time and a half for all time spent dealing with issues while on call as well as a small base amount that was something like 15% base salary for the days you were on call to account for the inconvenience of having to be near a computer, within cell service and able to respond within 20 minutes at any time of the day or night.I felt like this was fair compensation but I still wouldn't do it again. Getting woken up at 2 A.M. and having to troubleshoot something for an hour and then not being able to fall back asleep or having to interrupt a date or just not planning dates when you're on call is not worth it.Now my situation was multiple small systems deployed onsite at customer locations and subject to inconsistencies in their networks, weather related outages, failed microwave towers and computer illiterate users. So being on call meant you were almost certain to actually get called. A company with a more centralized failure stack probably goes days or weeks between the on-call person being called.

throw_away_981超过 8 年前

The article talks about services at Google being relatively stable and SREs there focusing on automating the instability away. A previous comment here also mentions that. In my experience, I find that to be not really true and more of a marketing image. The stress that being a Google SRE poses on family and relationships is huge and the job really is just like DevOps at other large service-based companies. The amount of code you write is orders of magnitude less compared to a software engineer since there is significant tooling available(Google being a mature company). Most of the work done is just operating those tools.I had an SRE girlfriend who became an ex because of the stresses it placed on our relationship. Although I was the one who helped her land the job and was with her through previous hardships, there were just too many missed dates and too little respect for my time and other stress related issues that breaking up was the only way out for me to regain peace of mind.Maybe you need a certain sort of person to handle that kind of stress.

评论 #12446499 未加载

skywhopper超过 8 年前

The perspective here is interesting and totally different from my experience as a sysadmin.If bad code is the most common problem, then maybe it's time to tighten up the testing and deployment procedures first. The reason operations is a different job is that they take care of much much different parts of the stack. Developers aren't going to be effective at their jobs if they have to also worry about tuning Java GC settings, analyzing database I/O bottlenecks, ensuring network security, worrying about network drivers, open file limits, and MTU size.In my experience, the stuff that happens in the middle of the night more often involves infrastructural problems that ultimately have nothing to do with the code. And so it makes sense for the developers to sleep. By all means, assign an on-call developer that the operations staff can page when it's determined there's a code problem, but if that has to happen very often, then something else is wrong in your procedures.

carlisle_超过 8 年前

I think the most frustrating part of this problem is how disengaged a lot of developers are from operational work. I don't think it's enough that we figure out who to delegate responsibility to. Both SRE/DevOps and developers should be always working together to avoid outages. There are usually things that make this hard, as described by Susan, but there has got to be a way to get people on the same page.As an operational person I want you to have feature but I don't want you to break things. Developers want to focus on pushing features instead of getting bogged down fixing the work of yesterday. I don't think it's enough to try to make these things work as the teams exist. I think there needs to be this mentality from the get-go to make things good on both sides. Teams need to be engaging either side throughout the entire engineering process, not just when they think they're ready for the other side.

scurvy超过 8 年前

Close and shorten the pain loop as much as possible. If it causes availability pain, it should inflict pain on those who caused it.If the developer wrote terrible code, the developer should be paged when the code/stack/framework breaks.If ops/SRE/whatever chose a terrible server platform or cloud provider, they should be paged when the server crashes or goes offline.Two decades of history has shown that the carrot doesn't work in this age of Internet companies. You gotta use the stick. I wished that the carrot works and there are those altruists who have only worked in ideal environments where the carrot works, but they are the extreme outliers. The average lifespan of companies these days is too short for employees to stick around and actually care too much. All jobs these days are gigs, and most are looking for the next one. Why would you waste time fixing your problems in this context?Close the pain loop.

palakchokshi超过 8 年前

The key point here is ownership.Now there are multiple ways to define and transfer ownership. The primary reason for the split of dev and operations team was so that dev teams are not held back maintaining systems when there's more dev work to be done. However the split only works when deployment is a weekly or monthly activity. For Continuous deployment the dev team should be on call till the knowledge transfer can be done.Where I work we have a split and our process works (in theory).1. Developers go through the build, test, deploy process2. Before deployment the dev and operations team meet and the dev team walks the ops team through the code, key changes, key functionality implemented.3. Ops team poses their questions e.g. what assumptions were made, what are the possible values for a particular config attribute, etc.4. Once Ops is comfortable they understand the changes the dev team turns over the application to the ops team.5. This knowledge transfer happens in an hour long meeting with key stakeholders from both teams present.6. This process is for weekly or biweekly deployments.7. For a brand new project/product the dev team does a complete walkthrough with the ops team over a period of 1 week and the dev team provides a 6 weeks "warranty" period for the application where in the dev team is on call.

agentgt超过 8 年前

One of the challenges I have had particularly with small teams (aka startups) is deciding what is worthy of failure and how to avoid fatigue if being too aggressive with what failure is.I have found if you are not aggressive with what a failure is (aggressive meaning classifying things that are not really fatal as outage... the system is up but there are lots of errors) it will bite you the in the ass in the long run. The small errors become frequent big errors.The problem is if you are too aggressive you will eventually get alerting fatigue.I don't have a foolproof solution. I have done things like fingerprinting exceptions and counting them to the extreme of failing really fast (ie crash on any error).In large part this because small teams just don't have the resources to get this right but still have the demands to deliver more functionality.I wish the article delved into this more because there are different levels of "its down".

donretag超过 8 年前

I turned down a job offer with Amazon, partly because of their on-call rotation. They do not have a devops team, or even a unified tech stack. Each team is responsible for the creation, deployment and maintenance of their own code. I have dealt with too much shoddy legacy code in my lifetime, there is no way I will be woken up at 3am to support it.Many years ago, there was no devops/SRE. You had the developers and a sysadmin team if your company was big enough. The sysadmins did not know about anything at the application level, so developers were always on call. With the advent and the rise of the devops role, developers can now focus with their main task.I used Hadoop very early on (version 0.12 perhaps?), and I removed Hadoop as a skill on my resume since I did not want to admin a cluster, just do the cool MapReduce programming. Once again, devops to the rescue.

swagtricker超过 8 年前

My team doesn't have too much of a problem with DevOps. Of course, there are 2 significant mitigating factors: 1) we have a large team of about 8 developers thus we do on-call for one week every other month, 2) we're a moderately strong XP shop so we pair program & TDD code plus factor integration tests into stories (i.e. we avoid shitty code and "only so-and-so knows that" B.S. problems in the 1st place). I would NOT agree to DevOps on a 3-4 person team w/o some sort of significant stipend/bonus program, and I would _NEVER_ do DevOps in a team that didn't pair & didn't have good testing practices. YMMV.

udkl超过 8 年前

I know from experience that Amazon engineers are responsible for the services they build. Amazon's motto is to push more operational tasks to the owners while providing them with great(?) monitoring and debugging tools to ease the load.There was also a Netflix talk about it's approach to operations which was very similar to Amazon's. I feel the way Netflix organizes it's general software processes is a mirror of that at Amazon's .... maybe partly due to the AWS influence.

draw_down超过 8 年前

Previously I worked in a place where it wasn't the case that deploys of new code often brought down the service (a webapp). If the service went down it was usually because some random DB or Kafka topic or something else I don't understand at all took a shit in the middle of the night.So we just kept deployments on weekdays before about 4pm and if the site went down outside of that, well, it wasn't because of a deploy. And if it was, we were there to fix.

评论 #12449740 未加载

jwatte超过 8 年前

What the article calls "devops" I call just "ops." If the engineers writing the system also run the system, then they are "devops.""Ops" proper is much older than "at least 20 years" because they trace a direct line back to sysadmins, which have been around since forever.Out system works OK. We have devs, and ops, and devops. Devs run their new service with help from devops until it's stable and a runbook exists. Then, it's handed off to ops to keep running, with support if it breaks from devs if it's still in active development, or dedicated maintenance devops if it's mature.Not perfect, but pretty good, and efficiently runs the business as well as letting us iterate.

ChemicalWarfare超过 8 年前

Most companies I worked at had a layered on call structure where level 1 would be someone like a Customer Relationship Manager, Level 2 would be someone from the "sysadmin" side, Layer 3 - an engineer from the dev side.Once the issue would get escalated to L3 it became a crapshoot as even having someone from the dev side does not guarantee them knowing anything about the system or a part of the system that is having an issue.

kbredemeier超过 8 年前

I'm currently a student at Holberton[1], and we did a project where we had to be on call ensuring optimal uptime for a server. It was really sweet to emulate a real work experience, but super stressful. I can't image if I had to be on call more than a night or two a month. [1] <a href="https://www.holbertonschool.com" rel="nofollow">https://www.holbertonschool.com</a>