> In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.<p>Something that I wish all databases and API servers would do, and that few actually do in practice, is to allocate a certain amount of headroom (memory and CPU) to "break glass in case of emergency" sessions. Have an interrupt fired periodically that listens exclusively on a port that will only be used for emergency instructions (but uses equal security measures to production, and is only visible internally). Ensure that it can allocate against a preallocated block of memory; allow it to schedule higher-priority threads. A small concession to make in the usual course of business, but when it's useful it's vital.
<p><pre><code> In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery.
</code></pre>
The DNS song seems appropriate.<p><a href="https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tune-of-let-it-be" rel="nofollow">https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...</a>
Something doesn't add up - CoreDNS's kubernetes plugin should be serving Service RRs from its internal cache even if APIServer is down because it's using cache.Indexer. The records would be stale but unless their application pods all restarted, which they could not since APIServer was down, or all CoreDNS pods got restarted, which, again, they could not, just records expiring from the cache shouldn't have caused full discovery outage.
I caused an API server outage once with a monitoring tool, however in my case it was a monstrosity of a 20,000 line script. We quickly realized what we had done and turned it off, and I have seen in very large clusters with 1000+ nodes that you need to be especially sensitive about monitoring API server resource usage depending on what precisely you are doing. Surprised they hadn't learned this lesson yet, given the likely scale of their workloads.
splitting the control and data plane is a great way to improve resilience and prevent everything from being hard down. I wonder how it could be accomplished with service discovery / routing.<p>maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.
Surprised they don’t have a slower rollout across multiple regions / kubernetes clusters, given the K8s API’s are a spof as shown here where a change brought the control plane down.<p>Also, stale-if-error is a far safer pattern for service discovery than ttl’d dns.
For me, this was stunning : “ 2:51pm to 3:20pm: The change was applied to all clusters”<p>How can such a large change not be staged in some manner or the other? Feedback loops have a way of catching up later which is why it’s important to roll out gradually.
Seems like automated node access could have also been helpful here. Kill the offending pods directly on the nodes to relieve API server pressure long enough to rollback