Putting out fires at 37signals: The on-call programmer

58 pointsby qrushabout 13 years ago

17 comments

noahnoahnoahabout 13 years ago

(I work at 37signals, though not as a sysadmin or developer)Just to clarify - we do have a 24/7 on-call system administrator who is the first line of defense for when things go wrong. They're the ones who get phone calls when things do go 'bump' in the night, and they're fantastic in every way.Our "on call" developers fix customer problems; rarely do these arise suddenly in the middle of the night, but our software has bugs (like most pieces of software) that impact customers immediately, and we've found it helpful to have a couple of developers at a time who focus on fixing those during business hours rather than working on a longer term project. Most companies probably don't call this "on call", but rather something like (as a commenter on the original post pointed out) "second level support". This is what Nick was describing in his post.Of course, fixing root causes is the best way to solve bugs, and we do a lot of this too. We've taken a significant dent (>= 30% reduction) out of our "on call" developer load over the last 6-12 months by going after these root cause issues.Hope that clarifies the situation some.

jtchangabout 13 years ago

Is this seriously a post highlighting the heroics of being on-call?!Wake up -- being on call sucks.Being an on call programmer is even worse. All developers should have to work support sometime in their life to realize the pain of supporting software vs writing it. Only then will you realize why doing it "right" the first time really matters.I kind of agree with the first comment on that post from Alice Young. Even though DHH just calls Alice out as trolling I know from experience that if you have on-call programmers it is a sign that your product is reaching a new level of complexity. Whether the complexity is coming from internal features or outside integrations it is probably time to take a second look at how you are handling your development processes.

评论 #3850558 未加载

评论 #3850150 未加载

评论 #3851808 未加载

shepbookabout 13 years ago

"I spend one week every ten or so, on call. Then I spend the next nine weeks writing code to make my next on call shift better." - Tom LimoncelliSure, people may write off the fact that Tom found his niche in systems administration. He's currently at Google, as a "Site Reliability Engineer" which (in case you aren't familiar) is about 40% development work and 60% systems administration work. (Though his recent project, Ganeti, seems far more development work.)I find it "amusing" how so many people are all "DevOps! DevOps! DevOps!" _until_ it causes some kind of inconvenience for the developer. (Pesky paying clients! Why must you want what you paid for, to work!) Then it's "Make the sysadmin's do it. That's Ops job. It's not my job, as a developer, to help fix the service when it breaks. I write the code... it's your job to make it work, sysadmins..." Operability is _everyone's_ responsibility. If your code fails, for whatever reason, it should fail gracefully. It should tell us why it failed. This is the basis of operable code. Of course, even with testing or the best, possible, operable code, shit will still happen.I think the division of labor is simple. If the failure is clearly software related (you know this because you monitor your systems/software), the on call developer is paged. If the failure is hardware or core OS/system related, the sysadmin is paged. If shit's on fire, both are paged.Yes, we all know "Well Designed Systems and Software" shouldn't experience catastrophic failure. Guess what, it happens, no matter how well you prepare. So, you prepare for the worst case and have processes in place on how to deal with such issues. Drill your developers and sysadmins. Preparation is key.Ultimately, _everyone_ on your team should carry the title of "Chief Make It Fucking Work Officer". If you don't get this, don't sit here and gripe about "Not being DevOps-y enough" as is so prevalent in what I read and hear these days. When the Sysadmin says, "No, you aren't pushing code today.", don't bitch. Perhaps if developers accepted responsibility for helping support the systems and software they write, the Sysadmins would be more open to working with the developers.DevOps Motherfucker. Do You (do more than just) Speak It?

vitovitoabout 13 years ago

I have to assume all of the other comments in this thread are from small shops that have never supported a live product.We run a multi-hundred person team here for a live, 24/7 product, and as many as half of our developers have been scheduled as "on-call programmers," which we call our Live team. Their sole responsibility is the live, deployed product and customer-impacting issues.They do no bug fixes outside of that. They do no feature development outside of that. There is an entire other team dedicated to those things, and like 37s, that team gets rotated through.We also have QA dedicated to the live product, Operations dedicated to the live product, etc., etc., all separate from new feature development, because an immediate, customer-facing issue requires different prioritization than feature development.

评论 #3849946 未加载

评论 #3849959 未加载

nupark2about 13 years ago

A requirement for 24/7 on-call programmers demonstrates a systemic organizational failure in the design and implementation of robust, well-architected software.37Signals would see significant savings in development and maintenance costs -- and increased customer satisfaction -- if they approached this staffing requirement as a band-aid, not as a final solution, and took a long, considered look at the root cause of this systemic failure.

评论 #3849820 未加载

评论 #3849760 未加载

评论 #3850050 未加载

评论 #3849824 未加载

评论 #3850670 未加载

Smudgeabout 13 years ago

Don't be too quick to condemn 37signals for needing on-call programmers. For many startups, the process goes like this: all devs are always on-call. It seems that 37signals at least makes the requirements of the job clear. The fact is, running a live service almost always requires some degree of live support. (Even the most robust production software will experience the occasional hiccup.)But it does seem like they're throwing money at the band-aids. Would love to see an article addressing how to fix the root of these sorts of problems, instead of just outlining how they put out all of their fires.

评论 #3849847 未加载

malbsabout 13 years ago

I like how quite a number of peoples answers to the on-call programmer blog was "you need better tests"here's a what if scenario:-- you have a third party service your systems rely on- at 4am on Sunday morning said 3rd party service upgrades their system, introducing a breaking change, having never bothered to notify users- you get a call as the on-call person saying "application X is not longer working, please resolve"How do tests stop that scenario from happening? Tests don't magically help you invent features/work around introduced issues in 3rd party systems.Those are typically the on-call issues we deal with (we're on a weekly rotation)

评论 #3850111 未加载

评论 #3849998 未加载

johngaltabout 13 years ago

Programmers shouldn't be on-call, but they should probably listen to the sysadmins who are.I'll never understand why it's so common to use programmers as IT/Sysadmins. Operating a working system is fundamentally different than building it. No one would expect a ship designer to be a captain. Sure there is enough overlap to make it possible, but why not have them each handle their specialty?If you've never experienced a good IT person backing you up I encourage you to try it. Detailed reports of failures/bottlenecks/repeatable issues. Problems already localized, and identified. No getting up at 2am!

评论 #3851841 未加载

biotabout 13 years ago

I'm curious to know what compensation people receive for being on-call, either as a percentage of salary or flat rate.(I'd submit a poll, but it appears from <a href="http://news.ycombinator.com/newpoll" rel="nofollow">http://news.ycombinator.com/newpoll</a> that polls are currently turned off.)

评论 #3850635 未加载

评论 #3851428 未加载

elliotandersonabout 13 years ago

They're a geographically spread out company with employees spanning multiple timezones. They work in small teams and cycle their programmers into the support teams to get them on the front lines. The programmers in the support teams are "on-call" for issues that come up, skipping the need to send the issue over the fence and take someone off application development.Whats the controversy? Despite the name of the position, it sounds like its just the role they assume in day to day work rather than fighting fires every couple of days.

ForrestNabout 13 years ago

More than whether or not they "should" or shouldn't need on-call programmers, I am curious what causes the majority of errors that are encountered. Is it mistakes the programmers have made? Unpredictable interactions that are caused by the complexity of the software? Unexpected user behavior or interactions with client software? Something else a novice like me can't anticipate?

efsavageabout 13 years ago

"We spend little time investigating crash bugs."Isn't "not crashing" kind of an implicit responsibility of any programmer? There are some bugs that aren't worth fixing, but even the most rare set of circumstances shouldn't be causing a crash for very long.

评论 #3851741 未加载

grover3333about 13 years ago

Classic example of someone developing without considering support.If I developed an app that required that much 'fire fighting', I'd replace it with something professional ASAP.Or is it the selected technology that is the problem here?

anon808about 13 years ago

There's a big difference between having programmers on-call and actually having work/fires for the on-call programmers to solve. We only know about one of these for sure from this post.

jtimbermanabout 13 years ago

If you write such awesome well tested code you won't mind being primary on-call to support it, since it won't break and you won't get paged.

paulhauggisabout 13 years ago

Honestly, this sounds like a nightmare. It brings me back to my sysadmin days when I was getting paid $10/hour.I would need to get paid lots of money to do this (dig into my precious free time). Probably more than 37signals is ever willing to may me.A buddy of mine is a sysadmin and told me that at his work, only the "best" techs get this duty. The company makes it sound like an honor to get pager duty and have to deal with putting out fires at 2am.

评论 #3850341 未加载

anthonybabout 13 years ago

I guess that's what happens when you don't have enough tests...</cheap shot>