I am a new manager of a large team (12 SREs) that are taking care of the Kubernetes platform in my company. This team is responsible for the provisioning pipelines (for both baremetal and AWS - no EKS is used), the Kubernetes controllers to integrate with other custom services, the observability stack, etc. The total fleet in use is around 6000 baremetal nodes and 1000 VMs in AWS spread over various DCs and regions. There are over 1500 developers actively using the Kubernetes clusters every day for a total of 2500 applications running in production.<p>The team spends a lot of time in operations as well as solving compliancy issues, vulnerability patching and customer support. The struggle I'm having is "how to drive focus" and avoid to die of operations. The team is large making the Scrum process ineffective.
Every time I try to define teams and to split the people I realise that everything on the platform is so interconnected that the moment I would create 2 or 3 separate teams they would start being on top of each other.<p>What would you recommend to do?
I wouldn't split the team just because personally I think it'll weaken the concept of the "platform team". Ideally what you want is people with different priorities/focuses within the team. Some on low level ops, others on high level APIs or services. Identifying who's skills fit what and letting them do that is key.<p>The other thing is just to split the category of work into 3 things; p1 bug fixes / long term projects / support work. Each week just make a note of time allocation for each based on what's happening (sometimes a p1 fix can take up the whole week). Try to minimise the support burden by creating office hours and defining SLAs for the rest of the company.<p>Make sure your team is not getting buried in support work. What's going to help them is just being able to filter out what's an immediate priority versus pushing off to tomorrow or the day after. Don't let them get bogged down or pinged constantly. Try to make that request flow async.<p>And most importantly, give them the time to accomplish tasks they think are most important. They are deep in the trenches and know what's going to be p1 vs not. Trust in their ability to guide the outcomes.
This sounds challenging! First I’d recommend kanban as opposed to scrum for the team, as the team is so large it is probably hard for them to be really working on a single scrum board effectively.<p>2nd I’d make sure the team had space to do their operational duties, possibly on a rotation so that other members can focus more on platform development. It can be tempting to try to minimize the operational work, especially when coming from more of a product oriented team. But it is important to develop good process to support the recurring operational labor.<p>Finally within that team of 12 I’d identify the most senior devs and tech leads and work with them closely on these process changes. Try to understand their problems and ensure they understand the goals coming down from the VPs etc.<p>Sorry if this is basic / already in play. Good luck!
> The total fleet in use is around 6000 baremetal nodes and 1000 VMs in AWS spread over various DCs and regions. There are over 1500 developers actively using the Kubernetes clusters every day for a total of 2500 applications running in production.<p>Seems like they are quite successful already.<p>> The struggle I'm having is "how to drive focus" and avoid to die of operations. The team is large making the Scrum process ineffective.<p>Drive focus to where? As stated above, the platform team seems to have achieved what every platform team other there is struggling to achieve. What does "die of operations" mean?<p>And above all, who cares about Scrum?
Splitting teams with interconnected and related work, I do something like <a href="https://agilesquads.org/" rel="nofollow">https://agilesquads.org/</a>. For each sprint, we'll have a planning cadence where we scope 2 weeks of work e.g. Feature X and Feature Y. Each squad would get assigned a feature and they're able to focus on delivering that feature. The benefit of this over a real "split" is that you can have both "teams" working on the same roadmap/project/feature and/or change how many engineers are working towards a single feature. When you include "squad leads", it's also great for career development and leadership.<p>On product work, we do an oncall rotation of 1 engineer per week that triages issues and handles prod outages. This may solve for your 'help desk' type work.
Promote some of the best engineers in the group to staff engineers (or try to recruit them but that could be very costly) and let them work on overarching issues in coop with your existing devops and engineering teams.<p>To grasp staff engineering as a manager (or a potential staff engineer) this is probably the best (and only?) resource around => <a href="https://staffeng.com/" rel="nofollow">https://staffeng.com/</a>