Honestly it sounds like the main problem here is the scheduler. I'm not saying you should run many massive threadpools, but at the end of the day if you have a latency-sensitive service that isn't being given CPU for seconds at a time your scheduler isn't suited for a latency-sensitive service.<p>Bursting is <i>good</i>. You are using resources that would otherwise be idle. It sounds here like the scheduler is punishing the task for the scheduler's mistake. CFS is ensuring that the job gets N cores <i>on average</i> what you actually want is the scheduler to ensure that the job gets N cores <i>minimum</i>.<p>So while having too many threads laying around is slight unnecessary pressure on the scheduler and wasted memory I don't think it should be causing these huge latency spikes.
We have big problems with this at work, in particular, an autoscaler that assumes services can use 100% of their CPU allocations. As Dan describes, this isn't true. But the autoscaler is both a cost-saving measure and a "dumb product engineers don't understand capacity planning" measure, so it can't be turned off, only downtimed for a while. For certain services, if we forget to renew the downtime, it's a guaranteed outage when we get downscaled and tail latencies degrade. Fun times.
I saw this problem recently at work, with a Go program running on Kubernetes. You can work around it by setting GOMAXPROCS to the same value as cpu limit in the container spec.<p>(So be careful not to assume this problem is specific to Java, the JVM, Mesos, or Twitter's environment)