Blending complex systems made my latency 10x higher

200 pointsby wheresvic3over 5 years ago

16 comments

bavellover 5 years ago

Very misleading title, was hoping for a more substantive read. Kubernetes itself wasn't causing latency issues, it was some config in their auth service and AWS environment.In the takeaways section, the author blames the issue on merging together complicated software systems. While absolutely true, this isn't specific to k8s at all. To specifically call out k8s as the reason for latency spiking is misleading.

评论 #21486183 未加载

评论 #21485056 未加载

评论 #21486325 未加载

评论 #21485947 未加载

评论 #21484704 未加载

评论 #21485638 未加载

评论 #21485778 未加载

评论 #21484516 未加载

01CGATover 5 years ago

"Once this change was applied, requests started being served without involving the AWS Metadata service and returned to an even lower latency than in EC2."Title should be: My configuration made my latency 10x higher.

评论 #21484659 未加载

评论 #21485815 未加载

评论 #21485990 未加载

lackerover 5 years ago

The problem wasn't in Kubernetes at all! The problem was in KIAM and the AWS Java SDK.It would be more accurate to criticize AWS's Kubernetes support. Both KIAM and the AWS Java SDK are specific to AWS.

评论 #21485816 未加载

TimMurnaghanover 5 years ago

What grinds my gears is hard-coded magic timeout numbers. Somehow microservice people seem to think that these are good (e.g. for circuit breakers) without seeming to realize the unexpected consequences of composing like this. Your timeout is not my timeout. So firstly - don't do it. Time is an awful thing to build behaviour on, and secondly if you ignore that then make it a config parameter - then I've got a chance of finding it without wire level debugging (if you document it).

评论 #21485597 未加载

beatover 5 years ago

And this is why I find DevOps work more interesting than programming. System integration is just endlessly challenging. I always enjoy reading a well-documented integration debugging session!

评论 #21485001 未加载

bborehamover 5 years ago

This issue has been hit by quite a number of people in the last year [1]AWS have recently added the feature natively [2], so once that version of the SDK is in use by all your pods you won't need Kiam any more.[1] <a href="https://github.com/uswitch/kiam/issues/191" rel="nofollow">https://github.com/uswitch/kiam/issues/191</a> [2] <a href="https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/" rel="nofollow">https://aws.amazon.com/blogs/opensource/introducing-fine-gra...</a>

antoncohenover 5 years ago

We have hit similar issues with GKE. GKE has a soon to be deprecated feature called "metadata concealment"[1], it runs a proxy[2] that intercepts the GCE metadata calls. Some of Google's own libraries made metadata requests at such a high rate that the proxy would lock up and not service any requests. New pods couldn't start on nodes with locked up metadata proxies, because those same libraries that overloaded the proxy would hang if metadata wasn't available.That was compounded by the metadata requests using DNS and the metadata IP, and until recently Kubernetes didn't have any built-in local DNS cache[3] (GKE still doesn't), which in turn overloaded kube-dns, making other DNS requests fail.We worked around the issues by disabling metadata concealment, and added metadata to /etc/hosts using pod hostAliases:<pre><code> hostAliases: - ip: "169.254.169.254" hostnames: - "metadata.google.internal" - "metadata" </code></pre> [1] <a href="https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment" rel="nofollow">https://cloud.google.com/kubernetes-engine/docs/how-to/prote...</a>[2] <a href="https://github.com/GoogleCloudPlatform/k8s-metadata-proxy" rel="nofollow">https://github.com/GoogleCloudPlatform/k8s-metadata-proxy</a>[3] <a href="https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/" rel="nofollow">https://kubernetes.io/docs/tasks/administer-cluster/nodeloca...</a>

wpietriover 5 years ago

I really enjoy find-the-issue stories like this. Bug hunting can be such fun.

jeffdavisover 5 years ago

"We are blending complex systems that had never interacted together before with the expectation that they collaborate forming a single, larger system."I suggest more people read the first few chapters of Specifying Systems by Lamport. Maybe the rest is good also, but that's as far as I got.It works through a trivial system (display clock) and combines it with another trivial system (display weather).Nothing Earth-shattering, but it really stuck with me. Thinking about it at that level gave me a new appreciation for what combining two systems means.

brainflakeover 5 years ago

Isn't this more of an AWS-specific issue?

kodablahover 5 years ago

Agreed about poor title, but:> DNS resolution is indeed a bit slower in our containers (the explanation is interesting, I will leave that for another post).I would like to see this expanded upon or hear if anyone else has suffered similar.

评论 #21485131 未加载

btownover 5 years ago

Isn't this a KIAM bug? The default configuration of any piece of software should not cause pathological cases in other pieces of software that are commonly used with it. Maybe I'm just a bleeding heart, but I think good software delights its users; the deployment and configuration story is a part of this.

nova22033over 5 years ago

this is good info but the title is misleading. This could have easily been "my latency increased because my local time was off" or something like that. This had nothing to do with kubernetes..it was a problem with their set up.

nilshaukover 5 years ago

It was a nice read. The real lesson here is how to diagnose issues like this.

yskchuover 5 years ago

> Kubernetes made my latency 10x higherThe title is a bit misleading, kub didn't cause the 10x latency - also latency was lower after they fixed their issuesTL;DR version - Migrate from EC2 to Kub; due to some default settings in Kiam & AWS Java SDK, latency of application increased, fixed after reconfiguration and kub latency lower than EC2

评论 #21484177 未加载

评论 #21484132 未加载

评论 #21484364 未加载

crb002over 5 years ago

Listen to Grace Hopper. Count the feet (nanoseconds) data has to travel at speed of light so the Admiral respects the laws of physics.