Splitting engineering teams into defense and offense

212 点作者 dakshgupta7 个月前

33 条评论

solatic7 个月前

This pattern has a smell. If you're shipping continuously then your on-call engineer is going to be fixing the issues the other engineers are shipping, instead of those engineers following up on their deployments and fixing issues caused by those changes. If you're not shipping continuously, then anyway customer issues can't be fixed continuously, and your list of bugs can be prioritized by management with the rest of the work to be done. The author quotes maker vs. manager schedules, but one of the conclusions of following that is that engineers don't talk directly to customers, because "talking to customers" is another kind of meeting, which is a "manager schedule" kind of thing rather than a "maker schedule" kind of thing.There's simply no substitute for Kanban processes and for proactive communication from engineers. In a small team without dedicated customer support, a manager takes the customer call, decides whether it's legitimately a bug, creates a ticket to track it and prioritizes it in the Kanban queue. An engineer takes the ticket, fixes it, ships it, communicates that they shipped something to the rest of their team, is responsible for monitoring it in production afterwards, and only takes a new ticket from the queue when they're satisfied that the change is working. But the proactive communication is key: other engineers on the team are also shipping, and everyone needs to understand what production looks like. Management is responsible for balancing support and feature tasks by balancing the priority of tasks in the Kanban queue.

评论 #41845956 未加载

评论 #41845627 未加载

评论 #41849083 未加载

评论 #41852256 未加载

评论 #41860978 未加载

dakiol7 个月前

I once worked for a company that required from each engineer in the team to do what they called “firefighting” during working hours (so not exactly on-call). So for one week, I was triaging bug tickets and trying to resolve them. These bugs belonged to the area my team was part of, so it affected the same product but a vast amount of micro services, most of which I didn’t know much about (besides how to use their APIs). It didn’t make much sense to me. So you have Joe punching code like there’s no tomorrow and introducing bugs because features must go live asap. And then it’s me the one fixing stuff. So unproductive. I always advocated for a slower pace of feature delivery (so more testing and less bugs on production) but everyone was like “are you from the 80s or something? We gotta move fast man!”

评论 #41845436 未加载

评论 #41842203 未加载

评论 #41842403 未加载

评论 #41842705 未加载

评论 #41846116 未加载

glenjamin7 个月前

Having a proportion of the team act as triage for issues / alerts / questions / requests is a generally good pattern that I think is pretty common - especially when aligned with an on-call rotation. I've done it a few times by having a single person in a team of 6 or 7 do it. If you're having to devote 50% of your 4-person team to this sort of work, that suggests your ratios are a bit off imo.The thing I found most surprising about this article was this phrasing:> We instruct half the team (2 engineers) at a given point to work on long-running tasks in 2-4 week blocks. This could be refactors, big features, etc. During this time, they don’t have to deal with any support tickets or bugs. Their only job is to focus on getting their big PR out.This suggests that this pair of people only release 1 big PR for that whole cycle - if that's the case this is an extremely late integration and I think you'd benefit from adopting a much more continuous integration and deployment process.

评论 #41847719 未加载

评论 #41847713 未加载

评论 #41849997 未加载

评论 #41850708 未加载

评论 #41847622 未加载

stopachka7 个月前

> While this is flattering, the truth is that our product is covered in warts, and our “lean” team is more a product of our inability to identify and hire great engineers, rather than an insistence on superhuman efficiency.> The result is that our product breaks more often than we’d like. The core functionality may remain largely intact but the periphery is often buggy, something we expect will improve only as our engineering headcount catches up to our product scope.I really resonate with this problem. It was fun to read. We've been tried different methods to balance customers and long-term projects too.Some more ideas that can be useful:* Make quality projects an explicit monthly goal.For example, when we noticed our the edges in our surface area got too buggy, we started a 'Make X great' goal for the month. This way you don't only have to react to users reporting bugs, but can be proactive* Reduce ScopeSometimes it can help to reduce scope; for example, before adding a new 'nice to have feature', focus on making the core experience really great. We also considered pausing larger enterprise contracts, mainly because it would take away from the core experience.---All this to say, I like your approach; I would also consider a few others (make quality projects a goal, and cut scope)

评论 #41851508 未加载

Attummm7 个月前

When you get to that stage, software engineering has failed fundamentally.This is akin to having a boat that isn't seaworthy, so the suggestion is to have a rowing team and a bucket team. One rows, and the other scoops the water out. While missing the actual issue at hand. Instead, focus on creating a better boat. In this case, that would mean investing in testing: unit tests, integration tests, and QA tests.Have staff engineers guide the teams and make their KPI reducing incidents. Increase the quality and reduce the bugs, and there will be fewer outages and issues.

评论 #41848735 未加载

评论 #41852908 未加载

评论 #41849885 未加载

评论 #41849757 未加载

fryz7 个月前

Neat article - I know the author mentioned this in the post, but I only see this working as long as a few assumptions hold:* avg tenure / skill level of team is relatively uniform* team is small with high-touch comms (eg: same/near timezone)* most importantly - everyone feels accountable and has agency for work others do (eg: codebase is small, relatively simple, etc)Where I would expect to see this fall apart is when these assumptions drift and holding accountability becomes harder. When folks start to specialize, something becomes complex, or work quality is sacrificed for short-term deliverables, the folks that feel the pain are the defense folks and they dont have agency to drive the improvements.The incentives for folks on defense are completely different than folks on offense, which can make conversations about what to prioritize difficult in the long term.

评论 #41842142 未加载

eschneider7 个月前

If the event-driven 'fixing problems' part of development gets separated from the long-term 'feature development', you're building a disaster for yourself. Nothing more soul-sucking than fixing other people's bugs while they happily go along and make more of them.

评论 #41842167 未加载

jedberg7 个月前

> this is also a very specific and usually ephemeral situation - a small team running a disproportionately fast growing product in a hyper-competitive and fast-evolving space.This is basically how we ran things for the reliability team at Netflix. One person was on call for a week at a time. They had to deal with tickets and issues. Everyone else was on backup and only called for a big issue.The week after you were on call was spent following up on incidents and remediation. But the remaining weeks were for deep work, building new reliability tools.The tools that allowed us to be resilient enough that being on call for one week straight didn't kill you. :)

评论 #41842151 未加载

cgearhart7 个月前

This is often harder at large companies because you very rarely make career progress playing defense, so it becomes very tricky to do it fairly. It can work wonders if you have the right teammates, but it’s almost a prisoners dilemma game that falls apart as soon as one person opts out.

评论 #41842120 未加载

shalmanese7 个月前

To the people pooh poohing this, do y’all really work with such terrible coworkers that you can’t imagine an effective version of this?You need trust in your team to make this work but you also need trust in your team to make any high velocity system work. Personally, I find the ideas here extremely compelling and optimizing for distraction minimization sounds like a really interesting framework to view engineering from.

评论 #41846748 未加载

jph7 个月前

Small teams shouldn't split like this IMHO. It's better/smarter/faster IMHO to do "all hands on deck" to get things done.For prioritization, use a triage queue because it aims the whole team at the most valuable work. This needs to be the mission-critical MVP & PMF work, rather than what the article describes as "event driven" customer requests i.e. interruptions.

评论 #41842129 未加载

评论 #41847978 未加载

d4nt7 个月前

I think they’re on to something, but the solution needs more work. Sometimes it’s not just individual engineers who are playing defence, it’s whole departments or whole companies that are set up around “don’t change anything, you might break it”. Then the company creates special “labs” teams to innovate.To borrow a football term, sometimes company structure seems like it’s playing the “long ball” game. Everyone sitting back in defence, then the occasional hail mary long pass up to the opposite end. I would love to see a more well developed understanding within companies that certain teams, and the processes that they have are defensive, others are attacking, and others are “mid field”, i.e. they’re responsible for developing the foundations on which an attacking team can operate (e.g. longer term refactors, API design, filling in gaps in features that were built to a deadline). To win a game you need a good proportion of defence, mid field and attack, and a good interface between those three groups.

svilen_dobrev7 个月前

IMO the split, although good (the pattern is "sacrifice one person" as per Coplien/Harrision's Organisational patterns book [0]), is too drastic. It should be not defense vs offense 100% with a wall inbetween, but for each and every issue (defense) and/or feature (offense), someone has to pick it and become the responsible (which may or may not mean completely doing it by hirself). Fixing a bug for an hour-or-two sometimes has been exactly the break i needed in order to continue digging some big feature when i feel stuck.And the team should check the balances once in a while, and maybe rethink the strategy, to avoid overworking someone and underworking someone else, thus creating bottlenecks and vacuums.At least this is the way i have worked and organised such teams - 2-5 ppl covering everything. Frankly, we never had many customers :/ but even one is enough to generate plenty of "noise" - which sometimes is just noise, but if good customer, will be mostly real defects and generally under-tended parts. Also, good customers accept a NO as answer. So, do say more NOs.. there is some psychological phenomena in software engineering in saying yes and promising moonshots when one knows it cannot happen NOW, but looks good..have fun![0] <a href="https://svilendobrev.com/rabota/orgpat/OrgPatterns-patlets.html" rel="nofollow">https://svilendobrev.com/rabota/orgpat/OrgPatterns-patlets.h...</a>

评论 #41848071 未加载

chiefalchemist7 个月前

Interesting concept. Certainly worth trying, but in the name of offense (read: being proactive):- "and our “lean” team is more a product of our inability to identify and hire great engineers, rather than an insistence on superhuman efficiency."Can we all at some point have a serious discussion on hiring and training. It seems that many teams are unstaffed or at least not satisfied with the quality and quantity of their team. Why is that? Why does it seem to be the norm?- what about mitigating bugs in the first place? Shouldn't someone be assigned to that? Yeah, sure, bugs are a given. They are going to happen. But in production bugs are something real and paying customers shouldn't experience. At the very least what about feature flags? That is sonething new is introduced to a limited number of user. If there's a bug and it's significant enough, the flag is flipped and the new feature withdrawn. Then the bug can be sorted as someone is available.Prehaps the profession just is what it is? Some teams are almost miraculously better than others? Maybe that's luck, individuals, product, and/or the stack? Maybe like plumbers and shit there are just things that engineering teams can't avoid? I'm not suggesting we surrender, but that we become more realistic about expectations.

philipwhiuk7 个月前

We have a person who is 'Batman' to triage production issues. Generally they'll pick up smaller sprint tasks. It rotates every week. It's still stuff from the team so they aren't doing stuff unknown (or if they are, it's likely they'll work on it soon).The aim is generally not to provide a perfect fix but an MVP fix and raise tickets in the queue for regular planning.It rotates round every week or so.My company's not very devops so it's not on-call, but it's 'point of contact'.

ryukoposting7 个月前

I can't be the only one who finds the graphics at the top of this article off-putting. I find it hard to take someone seriously when they plaster GenAI slop across the top of their blog.That said, there's some credence to what the author is describing. Although I haven't personally worked under the exact system described, I have worked in environments where engineers take turns being the first point of contact for support. In my experience, it worked pretty well. People know your bandwidth is going to be a bit shorter when you're on support, and so your tasks get dialed back a bit during that period.I think the author, and several people in the comments, make the mistake of assuming that an "engineer on support" necessarily can fix any given problem they are approached with. Larger firms could allocate a complete cross-functional team of support engineers, but this is very costly for small outfits. If you have mobile apps, in-house hardware products and/or integrations with third-party hardware, it's basically guaranteed that your support engineer(s) will eventually be given a problem that they don't have the expertise to solve.In that situation, the support engineer still has the competencies to figure out who does know how to fix the problem. So, the support engineer often acts more as a dispatcher than a singular fixer of bugs. Their impact is still positive, but more subtle than "they fix the bugs." The support engineer's deep system knowledge allows them to suss out important details before the bug is dispatched to the appropriate dev(s), thereby minimizing downtime for the folks who will actually implement the fix.

jwrallie7 个月前

I think interruptions damage the productivity overall, not only of engineers. Maybe some are unaware of it, and others simply don’t care. They don’t want to sacrifice their own productivity by waiting on someone busy, so they interrupt and after getting the information they want, they feel good. From their perspective, the productivity increased, not decreased.Some engineers are more likely to avoid interrupting others because they can sympathize.

smugglerFlynn7 个月前

Constantly working in what OP describes as defence might also be negatively affecting the perception of cause and effect of own actions:<pre><code> Specifically, we show that individuals following clock-time [where tasks are organized based on a clock**] rather than event-time [where tasks are organized based on their order of completion] discriminate less between causally related and causally unrelated events, which in turn increases their belief that the world is controlled by chance or fate. In contrast, individuals following event-time (vs. clock-time) appear to believe that things happen more as a result of their own actions.[0] </code></pre> ** - in my experience, clock based organisation seems to be very characteristic to what OP describes as defensive, when you become driven by incoming priorities and meetingsBroader article about impact of schedules at [1] is also highly relevant and worth the read.<pre><code> [0] - https://psycnet.apa.org/record/2014-44347-001 [1] - https://hbr.org/2021/06/my-fixation-on-time-management-almost-broke-me</code></pre>

评论 #41847872 未加载

october81407 个月前

My first job had a huge QA team. It was my job to work quickly and it was their job to find the issues. This actually set me up really poorly because I got in the habit of not doing proper QA. There were at least 10 people doing it for me. When I left it took awhile for me to learn what properly QAing my own worked looked like.

ntarora7 个月前

Our team ended up having the oncall engineer for the week also work primarily on bug squashing and anything that makes support easier. Over time the support and monitoring becomes better. Basically dedicated tech debt capacity, which has worked well for us.

marcinzm7 个月前

It feels like having 50% of your team's time be spent on urgent support, triage and bugs is a lot. That seems like a much better thing to solve versus trying to work around the issue by splitting the team. Probably having those people fix bugs while a 4 week re-factor in a secluded branch is constantly in process doesn't help with efficiency or bug rate.

评论 #41847684 未加载

JohnMakin7 个月前

This is a common "pattern" on well-ran ops teams. The work of a typical ops team consists of a lot of new work but tons of interruptions come in as new issues arise and must be dealt with. So we would typically assign 1 engineer (who was also typically on call) a lighter workload and would be responsible for triaging most issues that came in.

toolslive7 个月前

The proposed strategy will work, as will plenty of others, because it's a small team. That is the fundamental reason. Small teams are more efficient. So if you're managing a team of 10+ individuals: split them in 2 teams and keep them out of each other's way/harm.

ozim7 个月前

I like the approach as it is easy to explain and it is having catchy names.But sounds like there has to be a lot of micro management involved and when you have team of 4 it is easy to keep up but as soon as you go to 20 and that increase also means much more customer requests it will fall apart.

ndndjdjdn7 个月前

This is probably devops. A single team talking full responsibility and swapping oncall-type shifts. These guys know their dogfood.You want the defensive team to work on automating away stuff that pays off for itself in the 1-4 week timeframe. If they get any slack to do so!

stronglikedan7 个月前

Everyone on every team should have something to "own" and feel proud of. You don't "own" anything if you're always on team defense. Following this advice is a sure fire way to have a high churn rate.

评论 #41841946 未加载

评论 #41841994 未加载

评论 #41842002 未加载

bsimpson7 个月前

Ha - I think greptile was my first email address!Reptile was my favorite Mortal Kombat character, and our ISP added a G before all the sub accounts. They put a P in front of my dad's.

eiathom7 个月前

And, what else?Putting a couple of buzzwords on a practice being performed for at least 15 years now doesn't make you clever. Quite the opposite in fact.

评论 #41847820 未加载

bradarner7 个月前

Don't do this to yourself.There are 2 fundamental aspects of software engineering:Get it rightKeep it rightYou have only 4 engineers on your team. That is a tiny team. The entire team SHOULD be playing "offense" and "defense" because you are all responsible for getting it right and keeping it right. Part of the challenge sounds like poor engineering practices and shipping junk into production. That is NOT fixed by splitting your small team's cognitive load. If you have warts in your product, then all 4 of you should be aware of it, bothered by it and working to fix it.Or, if it isn't slowing growth and core metrics, just ignore it.You've got to be comfortable with painful imperfections early in a product's life.Product scope is a prioritization activity not an team organization question. In fact, splitting up your efforts will negatively impact your product scope because you are dividing your time and creating more slack than by moving as a small unit in sync.You've got to get comfortable telling users: "that thing that annoys you, isn't valuable right now for the broader user base. We've got 3 other things that will create WAY MORE value for you and everyone else. So we're going to work on that first."

评论 #41842527 未加载

评论 #41842117 未加载

评论 #41848369 未加载

评论 #41841863 未加载

Roelven7 个月前

Getting so tired of the war metaphors in attempts to describe software development. We solve business problems using code, we don't make a living by role-playing military tactics. Chill out my dudes

madeofpalk7 个月前

Somewhat random side note - I find it so fascinating that developers invented this myth that they’re the only people who have ‘concentration’ when this is so obviously wrong. Ask any ‘knowledge worker’ or yell even physical labourer and I’m sure they’ll tell you about the productivity of being "in the zone" and lack of interruptions. Back in early 2010s they called it ‘flow’.

评论 #41843635 未加载

评论 #41846765 未加载

Towaway697 个月前

What's wrong with collaboratively working together? Why is there a need to create an atificial competition between a "offence" and a "defence" team?And why should team members be collaborative amongst their team? E.g. why should the "offence" team members suddenly help each other if it's not happening generally?This sounds a lot like JDD - Jock Driven Development.Perhaps the underlying problems of "don't touch it because we don't understand it" should be solved before engaging in fake competition to increase the stress levels.

评论 #41846367 未加载

namenotrequired7 个月前

Many are complaining that this way the engineers are incentivised to carelessly create bugs because they have to ship fast and won’t be responsible for fixing them.That’s easy to fix with an exception: you won’t have to worry about support for X time unless you’re the one who recently made the bug.It turns out that once they’re responsible for their bugs, there won’t actually be that many bugs and so interruptions to a focused engineer will be rare.That's how we do it in my startup. We have six engineers, most are even pretty junior. Only one will be responsible for support in any given sprint and often he’ll have time left over to work on other things e.g. updating dependencies.