TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Google Kubernetes Engine incident spanning 9 days

181 点作者 talonx超过 1 年前

13 条评论

wg0超过 1 年前
May be this a hated take but just wondering - The place that pretty much invented cluster orchestration and reinvented it as k8s is having problems upgrading it.<p>What chance a bunch of poor sys admins stand running bunch of k8s clusters for a mid size company I wonder.<p>Every time I think of deploying (self managed, have done full stack) it for something mission critical, this upgrade scenario simply makes me rethink it altogether.<p>And even managed k8s has no guarantees and if managed is to be the option, nothing beats ECS in simplicity and smooth operation at certain scales.<p>PS: Full stack K8s means ingress controllers, DNS auto registration, GitOps , logging, monitoring, CI&#x2F;CD and all the bells and wistles including a management UI behind oauth etc.
评论 #37836044 未加载
评论 #37841474 未加载
评论 #37836055 未加载
评论 #37835223 未加载
评论 #37840568 未加载
评论 #37835759 未加载
评论 #37835681 未加载
alectroem超过 1 年前
Wow, I literally did a full cluster version upgrade last night without knowing about this. I would have delayed the upgrade if I had known GKE was failing for &quot;a small number of customers&quot;<p>I wish cloud providers would just communicate outages to services I use like this to me!
评论 #37832925 未加载
评论 #37833500 未加载
评论 #37832600 未加载
评论 #37833025 未加载
评论 #37834054 未加载
评论 #37833026 未加载
评论 #37833414 未加载
评论 #37833488 未加载
评论 #37832723 未加载
input_sh超过 1 年前
They also had an issue with creating and deleting persistent volumes on 1.25. It lasted for 15 days, or half a month(!) last month: <a href="https:&#x2F;&#x2F;status.cloud.google.com&#x2F;incidents&#x2F;EBxyHQgEPnbM3Syag5yL" rel="nofollow noreferrer">https:&#x2F;&#x2F;status.cloud.google.com&#x2F;incidents&#x2F;EBxyHQgEPnbM3Syag5...</a><p>I&#x27;m also incredibly annoyed at them displaying time in PDT. I genuinely don&#x27;t understand why they decided on that instead of doing something normal like UTC or detecting my timezone. Especially annoying every six months because Europe and the US don&#x27;t do Daylight Savings Time changes at the same time, so for a week or two there&#x27;s an additional hour I have to account for.
评论 #37849050 未加载
DishyDev超过 1 年前
Not a great week for managed Kubernetes services as Digital Ocean have been having an ongoing issue since yesterday morning on their service <a href="https:&#x2F;&#x2F;status.digitalocean.com&#x2F;incidents&#x2F;fsfsv9fj43w7" rel="nofollow noreferrer">https:&#x2F;&#x2F;status.digitalocean.com&#x2F;incidents&#x2F;fsfsv9fj43w7</a>
评论 #37835695 未加载
评论 #37834733 未加载
endisneigh超过 1 年前
Sucks that there isn’t anything simpler than k8s that’s production grade.<p>Maybe it’s time to yolo with a regular container that just restarts on failures, ha…
评论 #37833747 未加载
评论 #37832792 未加载
评论 #37833772 未加载
评论 #37832702 未加载
评论 #37832686 未加载
评论 #37833962 未加载
评论 #37832659 未加载
评论 #37841525 未加载
评论 #37833433 未加载
评论 #37832778 未加载
uniformlyrandom超过 1 年前
The incident impact (nodepool upgrade issue) seems to be matching the speed of mitigation rollout. One does not want the cure to be worse than the disease; roll forwards should be slow unless the impact is high (and even then, it should be a rollback&#x2F;freeze rather than fast roll forward).
评论 #37832603 未加载
评论 #37841261 未加载
shizcakes超过 1 年前
We&#x27;ve been stuck in this state for all 9 days. We&#x27;ve filed tickets, etc, but no resolution has come about yet. Just re-tried yesterday, still not able to update nodepool.
评论 #37833509 未加载
评论 #37835012 未加载
dilippkumar超过 1 年前
Noob here with some meta-questions about developer and operations complexity.<p>From an outsider’s perspective, it looks like in a 2x2 matrix of developer simplicity&#x2F;complexity and operational simplicity&#x2F;complexity, the current patterns all seem to be heavily biased for developer simplicity&#x2F;operational complexity.<p>1. Is this assumption correct?<p>2. Does optimizing for another quadrant: developer complexity &#x2F; operational simplicity make sense?<p>My intuition is that complexity in code can be managed far better than complexity in operations. Developers have abstractions, reusable libraries, unit tests&#x2F;integration tests, etc. There may also be weird efficiencies that arise from having developers deal with some of these problems right from the design stage.<p>It seems kubernetes takes a problem and pushes it to fully to operations.<p>Is there a solution that takes this problem and turns it into a developer problem?
评论 #37837612 未加载
评论 #37840855 未加载
mrweasel超过 1 年前
In Google terminology is a &quot;mitigation&quot; the same as a solution? I read it as &quot;Yeah, we still have no idea how to fix this correctly, but we have applied a temporary work-around&quot;.
评论 #37834613 未加载
评论 #37833443 未加载
评论 #37838664 未加载
vinni2超过 1 年前
I have been pulling my hair to fix this all week.
edude03超过 1 年前
Creating a new node group works which is super easy on GKE so it&#x27;s pretty much a non issue. Definitely frustrating but not as bad as it sounds at first brush
timo-e-aus-e超过 1 年前
oh man, you don&#x27;t wanna be the engineer on-call for that.
akokanka超过 1 年前
This shows the astronomical complexity of k8 systems even gods of k8 fail.
评论 #37832892 未加载