TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Monitoring Kubernetes in Production

59 点作者 twakefield大约 9 年前

4 条评论

sciurus大约 9 年前
I don&#x27;t get this. They say &quot;Monitoring the state of a Kubernetes cluster is not straightforward using traditional monitoring tools.&quot; but they didn&#x27;t try monitoring it using a traditional monitoring tool. They tried monit, which is a process watchdog that&#x27;s limited to a single host.<p>In the end it sounds like they created three things<p>1) Health checks of a kubernetes cluster&#x27;s individual components 2) End-to-end checks of a kubernetes cluster&#x27;s functionality 3) A distributed monitoring system for running those checks<p>I&#x27;m pretty sure they could have plugged their checks into e.g. Nagios (which is about as traditional as you can get) and been fine.
评论 #11305537 未加载
nzoschke大约 9 年前
Nice. Thanks for sharing lessons learned and tools that encode this knowledge.<p>I work on container cluster management full time myself. I am focusing on AWS ECS so the problems are very different technically but very similar conceptually.<p>The question is who watches the watchmen?<p>A container scheduler is supposed to be responsible for maintaining the entire health of the cluster. But if it has fundamental troubles in doing so, how do you automatically detect this and get it back into a working state?<p>On ECS I have a an agent container running one every instance that terminates the instance on observed failures. The most common problems I have observed are a bad disk (full, read-only, or too slow) and a locked up docker daemon.<p>I also schedule one more monitor process in the cluster that periodically monitors the ECS, EC2 and ASG APIs. A common failure is instances that lose ECS agent connectivity and need to be terminated.<p>All this hard won knowledge is encoded in the open source Convox platform: <a href="https:&#x2F;&#x2F;github.com&#x2F;convox&#x2F;rack" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;convox&#x2F;rack</a><p>The next problem is that sometimes this monitor container stops working due to the very problems it&#x27;s trying to correct! I plan to move it to a Lambda task to remove the correlated failure.<p>But I always wonder. Why aren&#x27;t these problems handled natively by Amazon and ECS?<p>The same question applies to this post. If you have to run additional monitoring to make kubernetes work reliably long term, can we consider that a kubernetes bug?
评论 #11305633 未加载
评论 #11305729 未加载
评论 #11305913 未加载
jbaptiste大约 9 年前
We had the very same issue with kube-dns not that long ago, have you considered running prometheus against your cluster ?<p>You can leverage the power of the kubernetes services to make viable monitoring along automatic discovery for new metrics&#x2F;services.<p>We&#x27;re using it for some time on three different clusters (kubernetes, aws and bare metal) and are very pleased with the performances.
评论 #11305498 未加载
hathym大约 9 年前
and who monitors the monitor of a monitor?
评论 #11304478 未加载