A Kubernetes/GKE mistake that cost me thousands of dollars

155 pointsby dankohn1over 5 years ago

24 comments

shrummover 5 years ago

The problem of learning by doing is that it's extremely hard to find good tutorials designed for production. Most of what I find these days is 'hello world'y and then you need some tool like Sentry to catch edge cases that don't get caught in your limited testing.I've 'rebuilt' our kubernetes cluster almost 3 times since I started by applying lessons learned from running the last iteration for a few months. It's just like anything else in software development, as you start your tech debt is high mostly due to inexperience. Force yourself to reduce that debt whenever you can.As an example: the first version had a bunch of N1's (1 VCPU machines) with hand written yaml files, no auto scaling. I had to migrate our database and had a headache updating the DB connection string on each deployment. Then I discovered external services, which let me define the DB hostname once. (<a href="https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-mapping-external-services" rel="nofollow">https://cloud.google.com/blog/products/gcp/kubernetes-best-p...</a>).It's just to say with kubernetes, I think it's impossible to approach it thinking you'll get it right the first time. Just dedicate more time to monitoring at the beginning so you don't do anything 'too stupid' and take the time to bake in what you learn to your cluster.

评论 #20970284 未加载

评论 #20969534 未加载

评论 #20970548 未加载

kingbirdyover 5 years ago

It would be helpful if the author specified what the price difference actually was here at the end of the day - their initial cost was $3,500/mo but they don't mention how much they pay now after changing instance types.

评论 #20968810 未加载

评论 #20969353 未加载

le_didilover 5 years ago

I would advise reading this section of the GKE docs. It explains the marginal gains in allocatable memory and cpu from running larger nodes <a href="https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#memory_cpu" rel="nofollow">https://cloud.google.com/kubernetes-engine/docs/concepts/clu...</a>For memory resources, GKE reserves the following:255 MiB of memory for machines with less than 1 GB of memory 25% of the first 4GB of memory 20% of the next 4GB of memory (up to 8GB) 10% of the next 8GB of memory (up to 16GB) 6% of the next 112GB of memory (up to 128GB) 2% of any memory above 128GBFor CPU resources, GKE reserves the following:6% of the first core 1% of the next core (up to 2 cores) 0.5% of the next 2 cores (up to 4 cores) 0.25% of any cores above 4 cores

markbnjover 5 years ago

I think the main issue this author ran into was caused by setting the CPU requests so far below the limits. I get that he was trying to communicate the spiky nature of the workload to the control plane, but I think it would have been better to reserve CPU to cover the spikes by setting the requests higher, especially on a 1 core node.It's important to grok the underlying systems here, imo. CPU requests map to the cpushares property of the container's cpu,cpuacct cgroup. A key thing about cpushares is that it guarantees a minimum number of 1/1024 shares of a core, but doesn't prevent a process from consuming more if they're available. The CPU limit uses the CPU bandwidth control scheduler, which specifies a number of slices per second (default 100k) and a share of those slices which the process cannot, afaik, exceed. So by setting the request to 20m and the limit to 200m the author left a lot of room for pods that look like they fit fine under normal operating conditions to spike up and consume all the CPU resources on the machine. K8s is supposed to reserve resources for the kubelet and other components on a per node basis but I'm not surprised it's possible to place these components under pressure using settings like these on a 1 core node.

评论 #20968840 未加载

评论 #20971045 未加载

Thaxllover 5 years ago

It's missing something crucial not mentioned, 100x nodes vs 1 means you have the overhead of running Kubernetes on those 100x nodes which is actually high ( kubelets ect ... ) on one node you just have 1core used by kube, the rest is available for your app.

评论 #20969370 未加载

rcarmoover 5 years ago

I've seen this kind of thing happen a number of times, and it's good to remind ourselves that oversubscribing resources is still a good way to tackle the "padding" related to scaling.I have been playing with an autoscaling k3s cluster (<a href="https://github.com/rcarmo/azure-k3s-cluster" rel="nofollow">https://github.com/rcarmo/azure-k3s-cluster</a>) in order to figure out the right way to scale up compute nodes depending on pod requirements, and even though the Python autoscaler I am noodling with is just a toy, I'm coming to the conclusion that all the work involved in using the Kubernetes APIs to figure out pod requirements and deciding whether to spawn a new VM based on that is barely more efficient than just using 25% "padding" in CPU metrics to trigger autoscaling with standard Azure automation, at least for batch workloads (I run Blender jobs on my toy cluster to keep things simple).YMMV, but it's fun to reminisce that oversubscription was _the_ way we dealt with running multiple services on VMware stacks, since it was very rare to have everything need all the RAM or IO at once.

jarfilover 5 years ago

That's less of a K8s issue and more of a general multiprocessing issue. Would you rather have:* 96x single-core CPUs with no multithreading* 1x 96-core CPU with multithreading, but running all cores at full power all the time* 1x 96-core CPU that can turn off sets of 16 cores at a time when they're not in use.

评论 #20969326 未加载

epiphoneover 5 years ago

Here's a whole collection of Kubernetes bloopers: <a href="https://github.com/hjacobs/kubernetes-failure-stories" rel="nofollow">https://github.com/hjacobs/kubernetes-failure-stories</a>. I for one am glad people are sharing!

rainyMammothover 5 years ago

I read the whole thing and couldn't tell what is that "mistake that costs thousands?"

评论 #20968557 未加载

asdfasgasdgasdgover 5 years ago

This is all very interesting, but one thing that occurs to me is: why are there so many idle pods? Is there any way to fold the work that is currently being done in multiple different pods into one pod? Perhaps via multiple threads, or even just plain single-threaded concurrency? Unless there is some special tenancy requirement, that might be the most efficient way to deal with this situation.

评论 #20971070 未加载

评论 #20969337 未加载

tdurdenover 5 years ago

It is difficult to determine what exactly the 'mistake' was in this post.

评论 #20969291 未加载

评论 #20969363 未加载

评论 #20969124 未加载

jackcodesover 5 years ago

One thing that I’ve been unable to wrap my head around is how to effectively calculate the right CPU share values for single threaded web-servers.I’ve got a project using this setup, but it’s fairly common one l- e.g. Express with node clustering, Puma on Rails etc. On Kubernetes you obviously just forgo the clustering ability and let k8s handle the concurrency and routing for you.So in this instance, I’m struggling to see why I wouldn’t request a value of 1vCPU for each process. My thinking is that my program is already single threaded, and asking the kubernetes CPU scheduler to spread resource between multiple single threaded processes is pure overhead. At that point I should allow each process to run to its full capacity. Is that correct?This I feel gets a lot more complex now that my chosen language, DB drivers, and web framework is just starting to support multithreading. That’s a can of worms I can’t begin to figure out how to allocate for - 2vCPU’s each? Does anyone know?

评论 #20969943 未加载

specialpover 5 years ago

K8s nodes are going to work much better with more CPUs (to a certain extent). As the post said when you have idle and busting pods, you need headroom. If you have single CPU nodes, your bursting pods are going to more often oversubscribe the node as the pool size is smaller. If you had 3 pods per CPU and on average one bursts out of 3 out during any time period, There's a chance you can have 2, or 3 go and cause pods to be evicted and moved. But if they were on 16 CPU nodes it averages out more. Also single CPU clusters will still need to run their network layer, kublet etc.As far as the 96 CPU instance that really isn't good either unless your pods were all taking 1+ CPUs each. Even then, I'd rather run 6 x 16 CPU. There's a pod limit cap of ~110 per node, and also not to mention the loss of redundancy. I find 16-32 CPU nodes the best balance.

评论 #20968826 未加载

评论 #20971960 未加载

lazyantover 5 years ago

This is more of a procedural mistake than a specific technical (Kubernetes/GKE) mistake, even if the tech stack is the root cause.This is a capacity planning or "right-sizing" problem. In prod you just don't go and flip completely your layout (100 1vCPU servers vs 1 100vCPU server or whatever) an more so in a stack you are not yet expert on, you change a bit and then measure. Actually you try to figure this out first in a dev environment if possible.

jayd16over 5 years ago

I'm really struggling with connecting his conclusion to what we know of his workload. Can someone spell it out for me?He has many idle pods with a bursty workload.The author says they need to reserve a lot of cpu or containers fail to create. Why is this? Wouldn't memory be more likely a cause for the failure? How does lack of CPU cause a failure?Later the author notes that a many core machine is good for his workload because "pods can be more tightly packed." How does that follow? A pod using above the reserved resources will bump up against the other pods on that physical machine whether you've virtualized it as a standard-1 or standard-16. Is there a cost savings because the unreserved ram over-provisioned? Wouldn't that overbooking be dangerous if you had uniform load across all the pods in a standard-16.Said another way, why is resource contention with yourself in a standard-16 better or cheaper than with others in the standard-1 pool?My understanding with going the vCPU options was simply the choice between pricing granularity and CPU overheard of k8s.

评论 #20972472 未加载

cpitmanover 5 years ago

Choosing the right size for nodes comes up often enough that I blogged some rough guidelines last year: <a href="http://cpitman.github.io/openshift/2018/03/21/sizing-openshift-and-kubernetes-nodes.html" rel="nofollow">http://cpitman.github.io/openshift/2018/03/21/sizing-openshi...</a>

abiroover 5 years ago

This article makes such a strong case for ditching k8s for serverless:- needs granular scaling- devops expertise is not core to the business- save developer time

评论 #20971803 未加载

pythonwutangover 5 years ago

> Therefore, the best thing to do is to deploy your program without resource limits, observe how your program behaves during idle/ regular and peak loads, and set requested/ limit resources based on the observed values.This is one of the author’s fatal assumption. The best practice I understand is to set cpu requests to be around 80% of peak and limits to 120% of peak before deploying to prod.They set themselves up for disaster with this architecture where they have many idle pods polling for resource availability. This resource monitoring should have been delegated to a single pod.Also it’s really unclear what specific strategy led to extra costs of 1000s of dollars...

dgoogover 5 years ago

what is the psychosis in the k8s community where they feel the need to talk about losing thousands of dollars? it's a recurring theme with this community that they think somehow makes them look cool - wouldn't that be a clear sign that they should not be using k8s to begin with?this community is ripe for implosion - what a joke

评论 #20969312 未加载

评论 #20969005 未加载

评论 #20969404 未加载

olalondeover 5 years ago

Anyone knows what the "CPU(cores)" means exactly (e.g. 83m)? What's that m unit?

评论 #20969686 未加载

评论 #20971980 未加载

dilyevskyover 5 years ago

Repeat this mantra every morning:Never set cpu limits, always set mem request=limit unless you really have good reason not to.

评论 #20972070 未加载

egdodover 5 years ago

> Pardon the interruption. We see you’ve read on Medium before - there’s a personalized experience waiting just a few clicks away. Ready to make Medium yours?Why in the world do I need an account to read a glorified blog? It’s text data, I should be able to consume it with curl if I’m so inclined.

评论 #20968897 未加载

评论 #20969822 未加载

评论 #20969316 未加载

评论 #20969848 未加载

评论 #20968991 未加载

评论 #20968836 未加载

test100over 5 years ago

Sorry to hear this.

rvzover 5 years ago

> As a disclaimer, I will add that this is a really stupid mistake and shows my lack of experience managing auto-scaling deployments.There is a reason why these DevOps certifications exist in the first place and why it is a huge risk for a company to spend lots of time and money on training to learn such a complex tool like Kubernetes (Unless they are preparing for a certification). Perhaps it would be better to hire a consultant skilled in the field rather than using it blind and creating these mistakes later.When mistakes like this occur and go unnoticed for a long time, it racks up and creates unnecessary costs which amount as much as $10k/month which depending on the company budget can be very expensive and can make or break a company.Unless you know what you are doing, don't touch tools you don't understand.

评论 #20968697 未加载

评论 #20968548 未加载

评论 #20968517 未加载

评论 #20968516 未加载

评论 #20968505 未加载

评论 #20968855 未加载

评论 #20968480 未加载