Very misleading title, was hoping for a more substantive read. Kubernetes itself wasn't causing latency issues, it was some config in their auth service and AWS environment.<p>In the takeaways section, the author blames the issue on merging together complicated software systems. While absolutely true, this isn't specific to k8s at all. To specifically call out k8s as the reason for latency spiking is misleading.
"Once this change was applied, requests started being served without involving the AWS Metadata service and returned to an even lower latency than in EC2."<p>Title should be: My configuration made my latency 10x higher.
The problem wasn't in Kubernetes at all! The problem was in KIAM and the AWS Java SDK.<p>It would be more accurate to criticize AWS's Kubernetes support. Both KIAM and the AWS Java SDK are specific to AWS.
What grinds my gears is hard-coded magic timeout numbers. Somehow microservice people seem to think that these are good (e.g. for circuit breakers) without seeming to realize the unexpected consequences of composing like this. Your timeout is not my timeout. So firstly - don't do it. Time is an awful thing to build behaviour on, and secondly if you ignore that then make it a config parameter - then I've got a chance of finding it without wire level debugging (if you document it).
And this is why I find DevOps work more interesting than programming. System integration is just endlessly challenging. I always enjoy reading a well-documented integration debugging session!
This issue has been hit by quite a number of people in the last year [1]<p>AWS have recently added the feature natively [2], so once that version of the SDK is in use by all your pods you won't need Kiam any more.<p>[1] <a href="https://github.com/uswitch/kiam/issues/191" rel="nofollow">https://github.com/uswitch/kiam/issues/191</a>
[2] <a href="https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/" rel="nofollow">https://aws.amazon.com/blogs/opensource/introducing-fine-gra...</a>
We have hit similar issues with GKE. GKE has a soon to be deprecated feature called "metadata concealment"[1], it runs a proxy[2] that intercepts the GCE metadata calls. Some of Google's own libraries made metadata requests at such a high rate that the proxy would lock up and not service any requests. New pods couldn't start on nodes with locked up metadata proxies, because those same libraries that overloaded the proxy would hang if metadata wasn't available.<p>That was compounded by the metadata requests using DNS and the metadata IP, and until recently Kubernetes didn't have any built-in local DNS cache[3] (GKE still doesn't), which in turn overloaded kube-dns, making other DNS requests fail.<p>We worked around the issues by disabling metadata concealment, and added metadata to /etc/hosts using pod hostAliases:<p><pre><code> hostAliases:
- ip: "169.254.169.254"
hostnames:
- "metadata.google.internal"
- "metadata"
</code></pre>
[1] <a href="https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment" rel="nofollow">https://cloud.google.com/kubernetes-engine/docs/how-to/prote...</a><p>[2] <a href="https://github.com/GoogleCloudPlatform/k8s-metadata-proxy" rel="nofollow">https://github.com/GoogleCloudPlatform/k8s-metadata-proxy</a><p>[3] <a href="https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/" rel="nofollow">https://kubernetes.io/docs/tasks/administer-cluster/nodeloca...</a>
"We are blending complex systems that had never interacted together before with the expectation that they collaborate forming a single, larger system."<p>I suggest more people read the first few chapters of <i>Specifying Systems</i> by Lamport. Maybe the rest is good also, but that's as far as I got.<p>It works through a trivial system (display clock) and combines it with another trivial system (display weather).<p>Nothing Earth-shattering, but it really stuck with me. Thinking about it at that level gave me a new appreciation for what combining two systems <i>means</i>.
Agreed about poor title, but:<p>> DNS resolution is indeed a bit slower in our containers (the explanation is interesting, I will leave that for another post).<p>I would like to see this expanded upon or hear if anyone else has suffered similar.
Isn't this a KIAM bug? The default configuration of any piece of software should not cause pathological cases in other pieces of software <i>that are commonly used with it.</i> Maybe I'm just a bleeding heart, but I think good software delights its users; the deployment and configuration story is a part of this.
this is good info but the title is misleading. This could have easily been "my latency increased because my local time was off" or something like that.
This had nothing to do with kubernetes..it was a problem with their set up.
> Kubernetes made my latency 10x higher<p>The title is a bit misleading, kub didn't cause the 10x latency - also latency was lower after they fixed their issues<p>TL;DR version - Migrate from EC2 to Kub; due to some default settings in Kiam & AWS Java SDK, latency of application increased, fixed after reconfiguration and kub latency lower than EC2