TechEcho

11 comments

btown5 months ago

> In order to make that fix, we needed to access the Kubernetes control plane – which we could not do due to the increased load to the Kubernetes API servers.Something that I wish all databases and API servers would do, and that few actually do in practice, is to allocate a certain amount of headroom (memory and CPU) to "break glass in case of emergency" sessions. Have an interrupt fired periodically that listens exclusively on a port that will only be used for emergency instructions (but uses equal security measures to production, and is only visible internally). Ensure that it can allocate against a preallocated block of memory; allow it to schedule higher-priority threads. A small concession to make in the usual course of business, but when it's useful it's vital.

评论 #42439591 未加载

评论 #42436852 未加载

评论 #42442717 未加载

评论 #42438533 未加载

评论 #42440261 未加载

评论 #42440658 未加载

geocrasher5 months ago

<pre><code> In short, the root cause was a new telemetry service configuration that unexpectedly generated massive Kubernetes API load across large clusters, overwhelming the control plane and breaking DNS-based service discovery. </code></pre> The DNS song seems appropriate.<a href="https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tune-of-let-it-be" rel="nofollow">https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...</a>

dilyevsky5 months ago

Something doesn't add up - CoreDNS's kubernetes plugin should be serving Service RRs from its internal cache even if APIServer is down because it's using cache.Indexer. The records would be stale but unless their application pods all restarted, which they could not since APIServer was down, or all CoreDNS pods got restarted, which, again, they could not, just records expiring from the cache shouldn't have caused full discovery outage.

评论 #42435886 未加载

评论 #42446945 未加载

JohnMakin5 months ago

I caused an API server outage once with a monitoring tool, however in my case it was a monstrosity of a 20,000 line script. We quickly realized what we had done and turned it off, and I have seen in very large clusters with 1000+ nodes that you need to be especially sensitive about monitoring API server resource usage depending on what precisely you are doing. Surprised they hadn't learned this lesson yet, given the likely scale of their workloads.

评论 #42437045 未加载

StarlaAtNight5 months ago

This quote cracked me up:“I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS”

评论 #42438848 未加载

ilaksh5 months ago

Wow, sounds like a nightmare. Operations staff definitely have real jobs.

dang5 months ago

Recent and related:ChatGPT Down - <a href="https://news.ycombinator.com/item?id=42394391">https://news.ycombinator.com/item?id=42394391</a> - Dec 2024 (30 comments)

jimmyl025 months ago

splitting the control and data plane is a great way to improve resilience and prevent everything from being hard down. I wonder how it could be accomplished with service discovery / routing.maybe instead of relying on kubernetes DNS for discovery it can be closer to something like envoy.the control plane updates configs that are stored locally (and are eventually consistent) so even if the control plane dies the data plane has access to location information of other peer clusters.

评论 #42438394 未加载

评论 #42441185 未加载

ec1096855 months ago

Surprised they don’t have a slower rollout across multiple regions / kubernetes clusters, given the K8s API’s are a spof as shown here where a change brought the control plane down.Also, stale-if-error is a far safer pattern for service discovery than ttl’d dns.

feyman_r5 months ago

For me, this was stunning : “ 2:51pm to 3:20pm: The change was applied to all clusters”How can such a large change not be staged in some manner or the other? Feedback loops have a way of catching up later which is why it’s important to roll out gradually.

评论 #42438914 未加载

评论 #42439508 未加载

nijave5 months ago

Seems like automated node access could have also been helpful here. Kill the offending pods directly on the nodes to relieve API server pressure long enough to rollback

评论 #42439009 未加载

11 comments

btown5 months ago

评论 #42439591 未加载

评论 #42436852 未加载

评论 #42442717 未加载

评论 #42438533 未加载

评论 #42440261 未加载

评论 #42440658 未加载

geocrasher5 months ago

dilyevsky5 months ago

评论 #42435886 未加载

评论 #42446945 未加载

JohnMakin5 months ago

评论 #42437045 未加载

StarlaAtNight5 months ago

This quote cracked me up:“I HAVE NO TOOLS BECAUSE I’VE DESTROYED MY TOOLS WITH MY TOOLS”

评论 #42438848 未加载

ilaksh5 months ago

Wow, sounds like a nightmare. Operations staff definitely have real jobs.

dang5 months ago

Recent and related:ChatGPT Down - <a href="https://news.ycombinator.com/item?id=42394391">https://news.ycombinator.com/item?id=42394391</a> - Dec 2024 (30 comments)

jimmyl025 months ago

评论 #42438394 未加载

评论 #42441185 未加载

ec1096855 months ago

feyman_r5 months ago

评论 #42438914 未加载

评论 #42439508 未加载

nijave5 months ago

Seems like automated node access could have also been helpful here. Kill the offending pods directly on the nodes to relieve API server pressure long enough to rollback

评论 #42439009 未加载

Quick takes on the recent OpenAI public incident write-up

11 comments

Quick takes on the recent OpenAI public incident write-up

11 comments