TechEcho

13 comments

wg0over 1 year ago

May be this a hated take but just wondering - The place that pretty much invented cluster orchestration and reinvented it as k8s is having problems upgrading it.What chance a bunch of poor sys admins stand running bunch of k8s clusters for a mid size company I wonder.Every time I think of deploying (self managed, have done full stack) it for something mission critical, this upgrade scenario simply makes me rethink it altogether.And even managed k8s has no guarantees and if managed is to be the option, nothing beats ECS in simplicity and smooth operation at certain scales.PS: Full stack K8s means ingress controllers, DNS auto registration, GitOps , logging, monitoring, CI/CD and all the bells and wistles including a management UI behind oauth etc.

评论 #37836044 未加载

评论 #37841474 未加载

评论 #37836055 未加载

评论 #37835223 未加载

评论 #37840568 未加载

评论 #37835759 未加载

评论 #37835681 未加载

alectroemover 1 year ago

Wow, I literally did a full cluster version upgrade last night without knowing about this. I would have delayed the upgrade if I had known GKE was failing for "a small number of customers"I wish cloud providers would just communicate outages to services I use like this to me!

评论 #37832925 未加载

评论 #37833500 未加载

评论 #37832600 未加载

评论 #37833025 未加载

评论 #37834054 未加载

评论 #37833026 未加载

评论 #37833414 未加载

评论 #37833488 未加载

评论 #37832723 未加载

input_shover 1 year ago

They also had an issue with creating and deleting persistent volumes on 1.25. It lasted for 15 days, or half a month(!) last month: <a href="https://status.cloud.google.com/incidents/EBxyHQgEPnbM3Syag5yL" rel="nofollow noreferrer">https://status.cloud.google.com/incidents/EBxyHQgEPnbM3Syag5...</a>I'm also incredibly annoyed at them displaying time in PDT. I genuinely don't understand why they decided on that instead of doing something normal like UTC or detecting my timezone. Especially annoying every six months because Europe and the US don't do Daylight Savings Time changes at the same time, so for a week or two there's an additional hour I have to account for.

评论 #37849050 未加载

DishyDevover 1 year ago

Not a great week for managed Kubernetes services as Digital Ocean have been having an ongoing issue since yesterday morning on their service <a href="https://status.digitalocean.com/incidents/fsfsv9fj43w7" rel="nofollow noreferrer">https://status.digitalocean.com/incidents/fsfsv9fj43w7</a>

评论 #37835695 未加载

评论 #37834733 未加载

endisneighover 1 year ago

Sucks that there isn’t anything simpler than k8s that’s production grade.Maybe it’s time to yolo with a regular container that just restarts on failures, ha…

评论 #37833747 未加载

评论 #37832792 未加载

评论 #37833772 未加载

评论 #37832702 未加载

评论 #37832686 未加载

评论 #37833962 未加载

评论 #37832659 未加载

评论 #37841525 未加载

评论 #37833433 未加载

评论 #37832778 未加载

uniformlyrandomover 1 year ago

The incident impact (nodepool upgrade issue) seems to be matching the speed of mitigation rollout. One does not want the cure to be worse than the disease; roll forwards should be slow unless the impact is high (and even then, it should be a rollback/freeze rather than fast roll forward).

评论 #37832603 未加载

评论 #37841261 未加载

shizcakesover 1 year ago

We've been stuck in this state for all 9 days. We've filed tickets, etc, but no resolution has come about yet. Just re-tried yesterday, still not able to update nodepool.

评论 #37833509 未加载

评论 #37835012 未加载

dilippkumarover 1 year ago

Noob here with some meta-questions about developer and operations complexity.From an outsider’s perspective, it looks like in a 2x2 matrix of developer simplicity/complexity and operational simplicity/complexity, the current patterns all seem to be heavily biased for developer simplicity/operational complexity.1. Is this assumption correct?2. Does optimizing for another quadrant: developer complexity / operational simplicity make sense?My intuition is that complexity in code can be managed far better than complexity in operations. Developers have abstractions, reusable libraries, unit tests/integration tests, etc. There may also be weird efficiencies that arise from having developers deal with some of these problems right from the design stage.It seems kubernetes takes a problem and pushes it to fully to operations.Is there a solution that takes this problem and turns it into a developer problem?

评论 #37837612 未加载

评论 #37840855 未加载

mrweaselover 1 year ago

In Google terminology is a "mitigation" the same as a solution? I read it as "Yeah, we still have no idea how to fix this correctly, but we have applied a temporary work-around".

评论 #37834613 未加载

评论 #37833443 未加载

评论 #37838664 未加载

vinni2over 1 year ago

I have been pulling my hair to fix this all week.

edude03over 1 year ago

Creating a new node group works which is super easy on GKE so it's pretty much a non issue. Definitely frustrating but not as bad as it sounds at first brush

timo-e-aus-eover 1 year ago

oh man, you don't wanna be the engineer on-call for that.

akokankaover 1 year ago

This shows the astronomical complexity of k8 systems even gods of k8 fail.

评论 #37832892 未加载

13 comments

wg0over 1 year ago

评论 #37836044 未加载

评论 #37841474 未加载

评论 #37836055 未加载

评论 #37835223 未加载

评论 #37840568 未加载

评论 #37835759 未加载

评论 #37835681 未加载

alectroemover 1 year ago

评论 #37832925 未加载

评论 #37833500 未加载

评论 #37832600 未加载

评论 #37833025 未加载

评论 #37834054 未加载

评论 #37833026 未加载

评论 #37833414 未加载

评论 #37833488 未加载

评论 #37832723 未加载

input_shover 1 year ago

评论 #37849050 未加载

DishyDevover 1 year ago

评论 #37835695 未加载

评论 #37834733 未加载

endisneighover 1 year ago

Sucks that there isn’t anything simpler than k8s that’s production grade.Maybe it’s time to yolo with a regular container that just restarts on failures, ha…

评论 #37833747 未加载

评论 #37832792 未加载

评论 #37833772 未加载

评论 #37832702 未加载

评论 #37832686 未加载

评论 #37833962 未加载

评论 #37832659 未加载

评论 #37841525 未加载

评论 #37833433 未加载

评论 #37832778 未加载

uniformlyrandomover 1 year ago

评论 #37832603 未加载

评论 #37841261 未加载

shizcakesover 1 year ago

We've been stuck in this state for all 9 days. We've filed tickets, etc, but no resolution has come about yet. Just re-tried yesterday, still not able to update nodepool.

评论 #37833509 未加载

评论 #37835012 未加载

dilippkumarover 1 year ago

评论 #37837612 未加载

评论 #37840855 未加载

mrweaselover 1 year ago

In Google terminology is a "mitigation" the same as a solution? I read it as "Yeah, we still have no idea how to fix this correctly, but we have applied a temporary work-around".

评论 #37834613 未加载

评论 #37833443 未加载

评论 #37838664 未加载

vinni2over 1 year ago

I have been pulling my hair to fix this all week.

edude03over 1 year ago

Creating a new node group works which is super easy on GKE so it's pretty much a non issue. Definitely frustrating but not as bad as it sounds at first brush

timo-e-aus-eover 1 year ago

oh man, you don't wanna be the engineer on-call for that.

akokankaover 1 year ago

This shows the astronomical complexity of k8 systems even gods of k8 fail.

评论 #37832892 未加载

Google Kubernetes Engine incident spanning 9 days

13 comments

Google Kubernetes Engine incident spanning 9 days

13 comments