Nice. Thanks for sharing lessons learned and tools that encode this knowledge.<p>I work on container cluster management full time myself. I am focusing on AWS ECS so the problems are very different technically but very similar conceptually.<p>The question is who watches the watchmen?<p>A container scheduler is supposed to be responsible for maintaining the entire health of the cluster. But if it has fundamental troubles in doing so, how do you automatically detect this and get it back into a working state?<p>On ECS I have a an agent container running one every instance that terminates the instance on observed failures. The most common problems I have observed are a bad disk (full, read-only, or too slow) and a locked up docker daemon.<p>I also schedule one more monitor process in the cluster that periodically monitors the ECS, EC2 and ASG APIs. A common failure is instances that lose ECS agent connectivity and need to be terminated.<p>All this hard won knowledge is encoded in the open source Convox platform: <a href="https://github.com/convox/rack" rel="nofollow">https://github.com/convox/rack</a><p>The next problem is that sometimes this monitor container stops working due to the very problems it's trying to correct! I plan to move it to a Lambda task to remove the correlated failure.<p>But I always wonder. Why aren't these problems handled natively by Amazon and ECS?<p>The same question applies to this post. If you have to run additional monitoring to make kubernetes work reliably long term, can we consider that a kubernetes bug?