TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Google Kubernetes Engine incident spanning 9 days

181 pointsby talonxover 1 year ago

13 comments

wg0over 1 year ago
May be this a hated take but just wondering - The place that pretty much invented cluster orchestration and reinvented it as k8s is having problems upgrading it.<p>What chance a bunch of poor sys admins stand running bunch of k8s clusters for a mid size company I wonder.<p>Every time I think of deploying (self managed, have done full stack) it for something mission critical, this upgrade scenario simply makes me rethink it altogether.<p>And even managed k8s has no guarantees and if managed is to be the option, nothing beats ECS in simplicity and smooth operation at certain scales.<p>PS: Full stack K8s means ingress controllers, DNS auto registration, GitOps , logging, monitoring, CI&#x2F;CD and all the bells and wistles including a management UI behind oauth etc.
评论 #37836044 未加载
评论 #37841474 未加载
评论 #37836055 未加载
评论 #37835223 未加载
评论 #37840568 未加载
评论 #37835759 未加载
评论 #37835681 未加载
alectroemover 1 year ago
Wow, I literally did a full cluster version upgrade last night without knowing about this. I would have delayed the upgrade if I had known GKE was failing for &quot;a small number of customers&quot;<p>I wish cloud providers would just communicate outages to services I use like this to me!
评论 #37832925 未加载
评论 #37833500 未加载
评论 #37832600 未加载
评论 #37833025 未加载
评论 #37834054 未加载
评论 #37833026 未加载
评论 #37833414 未加载
评论 #37833488 未加载
评论 #37832723 未加载
input_shover 1 year ago
They also had an issue with creating and deleting persistent volumes on 1.25. It lasted for 15 days, or half a month(!) last month: <a href="https:&#x2F;&#x2F;status.cloud.google.com&#x2F;incidents&#x2F;EBxyHQgEPnbM3Syag5yL" rel="nofollow noreferrer">https:&#x2F;&#x2F;status.cloud.google.com&#x2F;incidents&#x2F;EBxyHQgEPnbM3Syag5...</a><p>I&#x27;m also incredibly annoyed at them displaying time in PDT. I genuinely don&#x27;t understand why they decided on that instead of doing something normal like UTC or detecting my timezone. Especially annoying every six months because Europe and the US don&#x27;t do Daylight Savings Time changes at the same time, so for a week or two there&#x27;s an additional hour I have to account for.
评论 #37849050 未加载
DishyDevover 1 year ago
Not a great week for managed Kubernetes services as Digital Ocean have been having an ongoing issue since yesterday morning on their service <a href="https:&#x2F;&#x2F;status.digitalocean.com&#x2F;incidents&#x2F;fsfsv9fj43w7" rel="nofollow noreferrer">https:&#x2F;&#x2F;status.digitalocean.com&#x2F;incidents&#x2F;fsfsv9fj43w7</a>
评论 #37835695 未加载
评论 #37834733 未加载
endisneighover 1 year ago
Sucks that there isn’t anything simpler than k8s that’s production grade.<p>Maybe it’s time to yolo with a regular container that just restarts on failures, ha…
评论 #37833747 未加载
评论 #37832792 未加载
评论 #37833772 未加载
评论 #37832702 未加载
评论 #37832686 未加载
评论 #37833962 未加载
评论 #37832659 未加载
评论 #37841525 未加载
评论 #37833433 未加载
评论 #37832778 未加载
uniformlyrandomover 1 year ago
The incident impact (nodepool upgrade issue) seems to be matching the speed of mitigation rollout. One does not want the cure to be worse than the disease; roll forwards should be slow unless the impact is high (and even then, it should be a rollback&#x2F;freeze rather than fast roll forward).
评论 #37832603 未加载
评论 #37841261 未加载
shizcakesover 1 year ago
We&#x27;ve been stuck in this state for all 9 days. We&#x27;ve filed tickets, etc, but no resolution has come about yet. Just re-tried yesterday, still not able to update nodepool.
评论 #37833509 未加载
评论 #37835012 未加载
dilippkumarover 1 year ago
Noob here with some meta-questions about developer and operations complexity.<p>From an outsider’s perspective, it looks like in a 2x2 matrix of developer simplicity&#x2F;complexity and operational simplicity&#x2F;complexity, the current patterns all seem to be heavily biased for developer simplicity&#x2F;operational complexity.<p>1. Is this assumption correct?<p>2. Does optimizing for another quadrant: developer complexity &#x2F; operational simplicity make sense?<p>My intuition is that complexity in code can be managed far better than complexity in operations. Developers have abstractions, reusable libraries, unit tests&#x2F;integration tests, etc. There may also be weird efficiencies that arise from having developers deal with some of these problems right from the design stage.<p>It seems kubernetes takes a problem and pushes it to fully to operations.<p>Is there a solution that takes this problem and turns it into a developer problem?
评论 #37837612 未加载
评论 #37840855 未加载
mrweaselover 1 year ago
In Google terminology is a &quot;mitigation&quot; the same as a solution? I read it as &quot;Yeah, we still have no idea how to fix this correctly, but we have applied a temporary work-around&quot;.
评论 #37834613 未加载
评论 #37833443 未加载
评论 #37838664 未加载
vinni2over 1 year ago
I have been pulling my hair to fix this all week.
edude03over 1 year ago
Creating a new node group works which is super easy on GKE so it&#x27;s pretty much a non issue. Definitely frustrating but not as bad as it sounds at first brush
timo-e-aus-eover 1 year ago
oh man, you don&#x27;t wanna be the engineer on-call for that.
akokankaover 1 year ago
This shows the astronomical complexity of k8 systems even gods of k8 fail.
评论 #37832892 未加载