May be this a hated take but just wondering - The place that pretty much invented cluster orchestration and reinvented it as k8s is having problems upgrading it.<p>What chance a bunch of poor sys admins stand running bunch of k8s clusters for a mid size company I wonder.<p>Every time I think of deploying (self managed, have done full stack) it for something mission critical, this upgrade scenario simply makes me rethink it altogether.<p>And even managed k8s has no guarantees and if managed is to be the option, nothing beats ECS in simplicity and smooth operation at certain scales.<p>PS: Full stack K8s means ingress controllers, DNS auto registration, GitOps , logging, monitoring, CI/CD and all the bells and wistles including a management UI behind oauth etc.
Wow, I literally did a full cluster version upgrade last night without knowing about this. I would have delayed the upgrade if I had known GKE was failing for "a small number of customers"<p>I wish cloud providers would just communicate outages to services I use like this to me!
They also had an issue with creating and deleting persistent volumes on 1.25. It lasted for 15 days, or half a month(!) last month: <a href="https://status.cloud.google.com/incidents/EBxyHQgEPnbM3Syag5yL" rel="nofollow noreferrer">https://status.cloud.google.com/incidents/EBxyHQgEPnbM3Syag5...</a><p>I'm also incredibly annoyed at them displaying time in PDT. I genuinely don't understand why they decided on that instead of doing something normal like UTC or detecting my timezone. Especially annoying every six months because Europe and the US don't do Daylight Savings Time changes at the same time, so for a week or two there's an additional hour I have to account for.
Not a great week for managed Kubernetes services as Digital Ocean have been having an ongoing issue since yesterday morning on their service <a href="https://status.digitalocean.com/incidents/fsfsv9fj43w7" rel="nofollow noreferrer">https://status.digitalocean.com/incidents/fsfsv9fj43w7</a>
Sucks that there isn’t anything simpler than k8s that’s production grade.<p>Maybe it’s time to yolo with a regular container that just restarts on failures, ha…
The incident impact (nodepool upgrade issue) seems to be matching the speed of mitigation rollout. One does not want the cure to be worse than the disease; roll forwards should be slow unless the impact is high (and even then, it should be a rollback/freeze rather than fast roll forward).
We've been stuck in this state for all 9 days. We've filed tickets, etc, but no resolution has come about yet. Just re-tried yesterday, still not able to update nodepool.
Noob here with some meta-questions about developer and operations complexity.<p>From an outsider’s perspective, it looks like in a 2x2 matrix of developer simplicity/complexity and operational simplicity/complexity, the current patterns all seem to be heavily biased for developer simplicity/operational complexity.<p>1. Is this assumption correct?<p>2. Does optimizing for another quadrant: developer complexity / operational simplicity make sense?<p>My intuition is that complexity in code can be managed far better than complexity in operations. Developers have abstractions, reusable libraries, unit tests/integration tests, etc. There may also be weird efficiencies that arise from having developers deal with some of these problems right from the design stage.<p>It seems kubernetes takes a problem and pushes it to fully to operations.<p>Is there a solution that takes this problem and turns it into a developer problem?
In Google terminology is a "mitigation" the same as a solution? I read it as "Yeah, we still have no idea how to fix this correctly, but we have applied a temporary work-around".
Creating a new node group works which is super easy on GKE so it's pretty much a non issue. Definitely frustrating but not as bad as it sounds at first brush