I wouldn't trust the management of this team for anything. They appear totally incompetent in both management and basic half-brain analytical skills. Who in the heck creates a cluster per service per cloud provider, duplicate all the supporting services around it, burn money and sanity in a pit, and blame the tool.<p>Literally every single decision they listed was to use any of the given tools in the absolute worst, incompetent way possible. I wouldn't trust them with a Lego toy set with this record.<p>The people who quit didn't quit merely out of burnout. They quit the stupidity of the managers running this s##tshow.
Why exactly did they have 47 clusters? One thing I noticed (maybe because I’m not at that scale) is that companies are running 1+ clusters per application. Isn’t the point of kubernetes that you can run your entire infra in a single cluster, and at most you’d need a second cluster for redundancy, and you can spread nodes across regions and AZs and even clouds?<p>I think the bottleneck is networking and how much crosstalk your control nodes can take, but that’s your architecture team’s job?
If you have 200 YAML files for a <i>single service</i> and 46 clusters I think you're using k8s wrong. And 5 + 3 different monitoring and logging tools could be a symptom of chaos in the organization.<p>k8s, and the go runtime and network stack have been heavily optimized by armies of engineers at Google and big tech, so I am very suspicious of these claims without evidence. Show me the resource usage from k8s component overhead, and the 15 minute to 3 minute deploys and then I'll believe you. And the 200 file YAML or Helm charts so I can understand why in gods name you're doing it that way.<p>This post just needs a lot more details. What are the typical services/workloads running on k8s? What's the end user application?<p>I taught myself k8s in the first month of my first job, and it felt like having super powers. The core concepts are very beautiful, like processes on Linux or JSON APIs over HTTP. And its not too hard to build a CustomResourceDefinition or dive into the various high performance disk and network IO components if you need to.<p>I do hate Helm to some degree but there are alternatives like Kustomize, Jsonnet/Tanka, or Cue <a href="https://github.com/cue-labs/cue-by-example/tree/main/003_kubernetes_tutorial#controlling-kubernetes-with-cue">https://github.com/cue-labs/cue-by-example/tree/main/003_kub...</a>. You can even manage k8s resources via Terraform or Pulumi
How to save 1M off your cloud infra? Start from a 2M bill.<p>That's how I see most of these projects. You create a massively expensive infra because webscale, then 3 years down the road you (or someone else) gets to rebuild it 10x cheaper. You get to write two blog posts, one for using $tech and one for migrating off $tech. A line in the cv and a promotion.<p>But kudos for them for managing to stop the snowball and actually reverting course. Most places wouldn't dare because of sunken costs.
How do you end up with 200 yaml file “basic deployments” without anyone looking up from their keyboard and muttering “guys what are we doing”?<p>Honestly they could have picked any stack as next one because the key win here was starting from scratch
So they made bad architecture decisions, blamed it on Kubernetes for some reason, and then decided to rebuild everything from scratch. Solid. The takeaway being what? Don't make bad decisions?
Like most tech stories this had pretty much nothing to do the tool itself but with the people/organization. The entire article can be summarized with this one quote<p>> In short, organizational decisions and an overly cautious approach to resource isolation led to an unsustainable number of clusters.<p>And while I emphasize with how they could end up in this situation, it feels like a lot of words were spent blaming the tool choice vs being a cautionary tail about for example planning and communication.
So many astonishing things were done ...<p>> As the number of microservices grew, each service often got its own dedicated cluster.<p>Wow. Just wow.
Are the people who decided to spin up a separate kubernetes cluster for each microservice still employed at your organization? If so, I don't have high hopes for your new solution either.
I feel like OP would've been much better off if they just reworked their cluster into something sensible instead of abandoning K8s completely.<p>I've worked on both ECS and K8s, and K8s is much better. All of the problems they listed were poor design decisions, not k8s limitations.<p>- 47 Clusters: This is insane. They ack it in the post, but they could've reworked this.<p>- Multi-cloud: It's now not possible with ECS, but they could've limited complexity with just single-cloud k8s.
We have 3 cluster, prod, dev, test with a few pod each.<p>Each cluster is wasting tons of cpu and i/o bandwidth just to be idle. I was told that it is etcd doing thousands i/o per second and this is normal.<p>For a few monolith
47 clusters? Is that per developer? You could manage small, disposable VPS for every developer/environment, etc and only have Kubernetes cluster for a production environment...
Too bad the author and company are anonymous. I'd like to confirm my assumption that the author has zero business using k8s at all.<p>Infrastructure is a lost art. Nobody knows what they're doing. We've entered an evolutionary spandrel where "more tools = better" meaning the candidate for an IT role who swears by 10 k8s tools is always better than the one who can fix your infra, but will also remove k8s because it's not helping you at all.
The leaps in this writing pain me. There are other aspects, but they’ve been mentioned enough.<p>Vendor lock in does not come about by relying only on one cloud, but by adopting non-standard technology and interfaces. I do agree that running on multiple providers is the best way of checking if there is lock-in.<p>Lowering the level of sharing further by running per-service and per-stage clusters, as mentioned in the piece was likewise at best an uninformed decision.<p>Naturally moving to AWS and letting dedicated teams handle workload orchestration at much higher scale will yield better efficiencies. Ideally without giving up vendor-agnostic deployments by continuing the use of IaC.
Sensible. Kubernetes is an anti-pattern, along with containerized production applications in general.<p>-replicates OS services poorly<p>-OS is likely running on a hypervisor divvying up hardware resources into VPS's<p>-wastes ram and cpu cycles<p>-forces kubectl onto everything<p>-destroys integrity of kernel networking basic principles<p>-takes advantage of developer ignorance of OS and enhances it<p>I get it, it's a handy hack, for non production services or oneoff installs, but then its basically just a glorified VM
>$25,000/month just for control planes<p>To get to this point, someone must have fucked up way earlier by not knowing what they were doing. Don't do k8s kids !
I've looked into K8s some years back and found so many new concepts that I thought: is our team big enough for so much "new".<p>Then I read someone saying that K8s should never be used for teams <20 FTE and will require 3 people to learn it for redundancy (in case used to self-host a SaaS product). This seemed really good advice.<p>Our team is smaller than 20FTE, so we use AWS/Fargate now. Works like a charm.
What else is out there? I'm running docker swarm and it's extremely hard to make it work with ipv6. I'm running my software on a 1GB RAM cloud instance and I pay 4EUR/month, and k8s requires at least 1GB of RAM.<p>As of now, it seems like my only alternative is to run k8s on a 2GB of RAM system, so I'm considering moving to Hetzner just to run k3s or k0s.
I've read this article now multiple times and I'm still not sure if this is just good satire or if it's real and they can burn money like crazy or some subtle ad for aws managed cloud services :)
It's the same story over and over again. Nobody gets fired for choosing AWS or Azure. Clueless managers and resume driven developers, a terrible combination.
The good thing is that this leaves a lot of room for improvement for small companies, who can out compete larger ones by just not making those dumb choices.
Is there a non-paid version of this? The title is a little clickbait, but reading the comments here seems like this is a story that jumped on the k8s bandwagon, made a lot of terrible decisions along the way and now they're blaming k8s for everything.
When they pulled apart all those kubernetes clusters they probably found a single fat computer would run their entire workload.<p>“Hey, look under all that DevOps cloud infrastructure! There’s some business logic! It’s been squashed flat by the weight of all the containers and orchestration and serverless functions and observability and IAM.”
> 2 team members quit citing burnout<p>And I would have gotten away with it too if only someone would rid me of that turbulent meddling cluster orchestration tooling!
Kubernetes is not a one size fits all solution but even the bullet points in the article raise a number of questions. I have been working with Kubernetes since 2016 and keep being pragmatic on tech stuff. Currently support 20+ clusters with a team of 5 people across 2 clouds plus on-prem. If Kubernetes is fine for this company/project/business use case/architecture we'll use it. Otherwise we'll consider whatever fits best for the specific target requirements.<p>Smelly points from the article:<p>- "147 false positive alerts" - alert and monitoring hygiene helps. Anything will have a low signal-to-noise ratio if not properly taken care of. Been there, done that.<p>- "$25,000/month just for control planes / 47 clusters across 3 cloud providers" - multiple questions here. Why so many clusters? Were they provider-managed(EKS, GKE, AKS, etc.) or self-managed? $500 per control plane per month is too much. Cost breakdown would be great.<p>- "23 emergency deployments / 4 major outages" - what was the nature of emergency and outages? Post mortem RCA summary? lessons learnt?..<p>- "40% of our nodes running Kubernetes components" - a potential indicator of a huge number of small worker nodes. Cluster autoscaler been used? descheduler been used? what were those components?<p>- "3x redundancy for high availability" - depends on your SLO, risk appetite and budget. it is fine to have 2x with 3 redundancy zones and stay lean on resource and budget usage, and it is not mandatory for *everything* to be highly available 24/7/365.<p>- "60% of DevOps time spent on maintenance" - <a href="https://sre.google/workbook/eliminating-toil/" rel="nofollow">https://sre.google/workbook/eliminating-toil/</a><p>- "30% increase in on-call incidents" - Postmortems, RCA, lessons learnt? on-call incidents do not increase just because of the specific tool or technology being used.<p>- "200+ YAML files for basic deployments" - There are multiple ways to organise and optimise configuration management. How was it done in the first place?<p>- "5 different monitoring tools / 3 separate logging solutions" - should be at most one for each case. 3 different cloud providers? So come up with a cloud-agnostic solution.<p>- "Constant version compatibility issues" - if due diligence is not properly done. Also, Kubernetes API is fairly stable(existing APIs preserve backwards compatibility) and predictable in terms of deprecating existing APIs.<p>That being said, glad to know the team has benefited from ditching Kubernetes. Just keep in mind that this "you don't need ${TECHNOLOGY_NAME} and here is why" is oftentimes an emotional generalisation of someone's particular experience and cannot be applied as the universal rule.
> DevOps Team Is Happier Than Ever<p>Or course they are. The original value proposition of cloud providers managing your infra (and moreso with k8s) was that you could fire your ops team (now called "DevOps" because the whole idea didn't pan out) and the developers could manage their services directly.<p>In any case, your DevOps team has job security now.
The price comparison doesn't make sense if they used to have a multi cloud system and now its just AWS. Makes me fear this is just content paid by AWS. Actually getting multi cloud to work is a huge achievment and I would be super interested to hear of another tech standard that would make that easier.<p>also : post paywall mirror
<a href="https://archive.is/x9tB6" rel="nofollow">https://archive.is/x9tB6</a>
How did your managers ever _ever_ sign off on something that cost an extra $0.5M?<p>Either your pre profit or some other bogus entity, or your company streamlined moving to k8s and then further streamlined by cutting away things you don't need.<p>I'm frankly just alarmed at the thought of wasting that much revenue, I could bring up a fleet of in house racks for that money!
I feel like the medium paywall saved me... as soon as I saw "47 clusters across 3 different cloud providers", I begin to think that the tool used here might not actually the real issue.
> We were managing 47 Kubernetes clusters across three cloud providers.<p>What a doozy of a start to this article. How do you even reach this point?
Oh boy. Please, please stop using Medium for anything. I have lost count of how many potentially interesting or informative articles are published behind the Medium sign-in wall. At least for me, if you aren't publishing blog articles in public then what's the point of me trying to read them.
Don't bother reading. This is just another garbage in garbage out kind of article written by something that ends in gpt. Information density approaches zero in this one.