TechEcho

3 comments

kevincoxover 3 years ago

Honestly it sounds like the main problem here is the scheduler. I'm not saying you should run many massive threadpools, but at the end of the day if you have a latency-sensitive service that isn't being given CPU for seconds at a time your scheduler isn't suited for a latency-sensitive service.Bursting is good. You are using resources that would otherwise be idle. It sounds here like the scheduler is punishing the task for the scheduler's mistake. CFS is ensuring that the job gets N cores on average what you actually want is the scheduler to ensure that the job gets N cores minimum.So while having too many threads laying around is slight unnecessary pressure on the scheduler and wasted memory I don't think it should be causing these huge latency spikes.

closeparenover 3 years ago

We have big problems with this at work, in particular, an autoscaler that assumes services can use 100% of their CPU allocations. As Dan describes, this isn't true. But the autoscaler is both a cost-saving measure and a "dumb product engineers don't understand capacity planning" measure, so it can't be turned off, only downtimed for a while. For certain services, if we forget to renew the downtime, it's a guaranteed outage when we get downscaled and tail latencies degrade. Fun times.

Tibbesover 3 years ago

I saw this problem recently at work, with a Go program running on Kubernetes. You can work around it by setting GOMAXPROCS to the same value as cpu limit in the container spec.(So be careful not to assume this problem is specific to Java, the JVM, Mesos, or Twitter's environment)

The Container Throttling Problem

3 comments

The Container Throttling Problem

3 comments