科技回声

4 条评论

sciurus大约 9 年前

I don't get this. They say "Monitoring the state of a Kubernetes cluster is not straightforward using traditional monitoring tools." but they didn't try monitoring it using a traditional monitoring tool. They tried monit, which is a process watchdog that's limited to a single host.In the end it sounds like they created three things1) Health checks of a kubernetes cluster's individual components 2) End-to-end checks of a kubernetes cluster's functionality 3) A distributed monitoring system for running those checksI'm pretty sure they could have plugged their checks into e.g. Nagios (which is about as traditional as you can get) and been fine.

评论 #11305537 未加载

nzoschke大约 9 年前

Nice. Thanks for sharing lessons learned and tools that encode this knowledge.I work on container cluster management full time myself. I am focusing on AWS ECS so the problems are very different technically but very similar conceptually.The question is who watches the watchmen?A container scheduler is supposed to be responsible for maintaining the entire health of the cluster. But if it has fundamental troubles in doing so, how do you automatically detect this and get it back into a working state?On ECS I have a an agent container running one every instance that terminates the instance on observed failures. The most common problems I have observed are a bad disk (full, read-only, or too slow) and a locked up docker daemon.I also schedule one more monitor process in the cluster that periodically monitors the ECS, EC2 and ASG APIs. A common failure is instances that lose ECS agent connectivity and need to be terminated.All this hard won knowledge is encoded in the open source Convox platform: <a href="https://github.com/convox/rack" rel="nofollow">https://github.com/convox/rack</a>The next problem is that sometimes this monitor container stops working due to the very problems it's trying to correct! I plan to move it to a Lambda task to remove the correlated failure.But I always wonder. Why aren't these problems handled natively by Amazon and ECS?The same question applies to this post. If you have to run additional monitoring to make kubernetes work reliably long term, can we consider that a kubernetes bug?

评论 #11305633 未加载

评论 #11305729 未加载

评论 #11305913 未加载

jbaptiste大约 9 年前

We had the very same issue with kube-dns not that long ago, have you considered running prometheus against your cluster ?You can leverage the power of the kubernetes services to make viable monitoring along automatic discovery for new metrics/services.We're using it for some time on three different clusters (kubernetes, aws and bare metal) and are very pleased with the performances.

Monitoring Kubernetes in Production

4 条评论

Monitoring Kubernetes in Production

4 条评论