Google Kubernetes Engine's third consecutive day of service disruption

779 pointsby rlancerover 6 years ago

38 comments

I am currently evaluating GCP for two separate projects. I want to see if I understand this correctly:1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.

评论 #18430765 未加载

评论 #18430075 未加载

评论 #18430065 未加载

评论 #18430028 未加载

评论 #18430185 未加载

评论 #18429895 未加载

评论 #18429897 未加载

评论 #18431623 未加载

评论 #18431059 未加载

评论 #18429943 未加载

usmannkover 6 years ago

We had an issue a few weeks ago where the google front-end servers were mangling responses from Pub/Sub and returning 502 responses, making the service completely unusable and knocking over a number of things we have running in production. Despite paying for enterprise support and having in a P1 ticket, we had to spend Friday to Sunday gathering evidence to prove to the support staff that there was indeed a problem, because their monitoring wasn't detecting it. Right now I'm doing something similar (and since Friday!) but for TLS issues they're having. Again, because their support reps don't believe there's a problem. There are so many more problems than they ever show on their status page...

评论 #18429576 未加载

评论 #18430669 未加载

评论 #18430200 未加载

评论 #18430354 未加载

评论 #18431255 未加载

Jedi72over 6 years ago

"The data says engagement is down 46%, I think its time we drop the product."- Someone at Google right now, probably.

评论 #18429117 未加载

justinsbover 6 years ago

Hi - I work at Google on GKE - sorry about the problems you're experiencing. There's a lot of people inside Google looking into this right now!It looks like the UI issue was actually fixed, and that we just didn't update the status dashboard correctly. But we're double checking that and looking into some of the additional things you all have reported here.

评论 #18429330 未加载

评论 #18429012 未加载

评论 #18429801 未加载

评论 #18429199 未加载

评论 #18434556 未加载

评论 #18430100 未加载

hacknatover 6 years ago

Question to Google employees:Why do you guys suffer global outages? This is your 2nd major global outage in less than 5 years. I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective. I need to see some blog posts about how you guys are rethinking whatever design can lead to this - twice - or you are never getting a cent of money under my control. You have the most feature rich cloud (particularly your networking products), but down time like this is unacceptable.

评论 #18430908 未加载

评论 #18429165 未加载

评论 #18429580 未加载

评论 #18429187 未加载

评论 #18429156 未加载

评论 #18430207 未加载

评论 #18429368 未加载

评论 #18429813 未加载

评论 #18429773 未加载

scarface74over 6 years ago

Say I were a CTO (I’m nowhere near it), why would I choose GCP over AWS or Azure? Even if after doing a technical assessment and I thought that GCP was technically slightly better, if something happened, the first question I would be asked is “why did you choose GCP over AWS?”No one would ever ask why you chose AWS. The old “no one ever got fired for buying IBM”.Even if you chose Azure because you’re a Microsoft shop, no one would question your choice of MS. Besides, MS is known for thier enterprise support.From a developer/architect standpoint, I’ve been focused the last year on learning everything I could about AWS and chose a company that fully embraced it. AWS experience is much more marketable than GCP. It’s more popular than Azure too, but there are plenty of MS shops around that are using Azure.

评论 #18430455 未加载

评论 #18430419 未加载

评论 #18430622 未加载

AlexB138over 6 years ago

This has been going on longer than three days. We have been dealing with this exact issue since at least Monday (11/5) morning in us-central1.

评论 #18428905 未加载

评论 #18434581 未加载

marcinzmover 6 years ago

>Nov 09, 2018 05:59>We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.Wait, did the people tasked with fixing this just take the weekend off?

评论 #18429094 未加载

评论 #18428965 未加载

rlancerover 6 years ago

Status page is inaccurate as issues doesn't only affect the web UI, the same operations are not functioning via the CLI.

评论 #18428650 未加载

评论 #18428907 未加载

评论 #18428820 未加载

评论 #18428644 未加载

scarface74over 6 years ago

A generic question: Our company is completely dependent on AWS. Sure we have taken all of the standard precautions for redundancy, but what happened here could just as easily happen with AWS - a needed resource is down globally.What would a small business do as a contingency plan?

评论 #18429029 未加载

评论 #18429072 未加载

评论 #18429003 未加载

评论 #18428993 未加载

评论 #18429018 未加载

评论 #18429338 未加载

评论 #18429095 未加载

评论 #18429036 未加载

评论 #18429099 未加载

评论 #18429002 未加载

评论 #18429687 未加载

评论 #18430001 未加载

评论 #18430446 未加载

评论 #18428994 未加载

评论 #18428989 未加载

rlancerover 6 years ago

UPDATE: Got some clarity, these issues are caused by "resource exhaustion" meaning there are no resources left to be allocated.

评论 #18430429 未加载

7ewisover 6 years ago

I honestly don't mind if providers have outages - we can't expect 100.00% accuracy, I know the systems I manage certainly don't achieve that.One thing I do care about though, is root cause analysis. I love reading a good RCA, it restores my faith in the company and makes me trust them more.(I'm not affect by the GKE outage so opinions may differ right now!)

locusmover 6 years ago

Do not use GCP without paying for support. We have had resource allocation errors for weeks, as have a lot of other people. Check out the posts in their forum where folk on basic support get zero love. <a href="https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/gce-discussion" rel="nofollow">https://groups.google.com/forum/?utm_medium=email&utm_source...</a>

thwy12321over 6 years ago

Been trying to spin up vm instances all day, had to try every single zone just to get one up. Not only is this incredibly harmful to a technology business dependent on this infra, it wasnt obvious to me what the issue was until I tried creating instances. Nothing says, hey resources are constrained here, try this one. Just about ready to bite the bullet and move to aws.

评论 #18428902 未加载

sladeyover 6 years ago

Seems to be some weird underlying issue going on at GCP at the moment. Had cloud build webhooks returning a 500 error. Noticed we were at 255 images and deleting some fixed the issue. Created a P2 ticket about the issue before we managed to solve it and haven't had a response in 40+ hours.The timeline of this disruption matches when we started experiencing cloud build errors.

评论 #18430426 未加载

ernsheongover 6 years ago

"third consecutive day of service disruption" is not an accurate statement? Latest update was Nov 11 saying things resolved on Nov 9.<a href="https://status.cloud.google.com/incident/container-engine/18005" rel="nofollow">https://status.cloud.google.com/incident/container-engine/18...</a>

评论 #18431418 未加载

013aover 6 years ago

Cloud providers have all of the potential in the world to make each region truly isolated. I shouldn't have to architect my application to be multi-cloud, at least for stability reasons.Yet, somehow every major cloud provider experiences global outages.That old AWS S3 outage in us-east-1 was an interesting one; when it went down, many services which rely on S3 also went down, in other regions beside us-east-1 because they were using us-east-1 buckets. I have a feeling this is more common than you'd think; globally-redundant services which rely on some single point of geographical failure for some small part.

评论 #18429926 未加载

spiderPigover 6 years ago

Our company is dependent on this as well and the way customer service has been handling this has been abysmal thus far.

qaqover 6 years ago

There is no magic public clouds have incredibly complex control planes and marketing fluff aside you would very likely experience much better uptime at singe top tier DC than @ a cloud provider.

arunodaover 6 years ago

The is not only GKE. But for GCE as well. I cannot create instance is almost all zones. I tried both preemptible and normal as well.Always saying resource not available. My account is a pretty new account.In contrast, one of my friend is having a pretty old account which is very active. He has no such issue.So I think due to this issue, Google has enabled some resource limitation for new accounts.But they should properly communicate this issue.

gigatexalover 6 years ago

Oh man must be a tough time to be an SRE at google cloud. But... they’re Google. They have been doing internal cloud for years and years. Borg — which K8s is a reimplementation if — has been the heart of Google for so long now you’d think they’d be able to architect their systems to have no outages whatsoever. I mean nobody is perfect but this looks bad.

评论 #18429014 未加载

closeparenover 6 years ago

Doesn’t GKE “just” run an independent Kubernetes cluster on customer VMs? How is a widespread outage like this possible?

评论 #18428734 未加载

评论 #18429201 未加载

评论 #18428727 未加载

fizzledbitsover 6 years ago

As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request"An instance in us-central1-a has refused to start since last Thursday or Friday.I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.And yet, the status page says all services are available.Is the typical of others' experiences?

wijowaover 6 years ago

Right now we're experiencing an issue where a small percentage of end users on our GKE site are getting super slow speeds. The issue is ISP related as they can switch to a 4G hot spot in the same location and get normal speeds... and inside our system the timing looks normal. So there's a slowdown either TO the load balancer or FROM the load balancer. Took a week to convince Google's support contractor to even believe it wasn't an issue with our site and their advice is generally along the lines of Turn it off and Turn it back on again (which might actually fix the problem) though that's easier said than done in GCP.

nielsoleover 6 years ago

I use preemptible machines in autodialing and for first time did not have any machines available for multiple hours yesterday. I am wondering whether this falls under the normal preemptible behaviour or this service degradation.

wb3techover 6 years ago

If anyone is interested, here is my documented experience with this issue. I freaking love GCP and GKE, although I have not production environment as it was a HA cluster in us-central1. Working federation now.<a href="https://stackoverflow.com/questions/53244471/gke-cluster-wont-provision-in-any-region" rel="nofollow">https://stackoverflow.com/questions/53244471/gke-cluster-won...</a>

regnerbaover 6 years ago

Is this just about creating new pools? I haven't noticed an issue with our existing pools scaling.

评论 #18428745 未加载

_wmdover 6 years ago

When guerilla marketing backfires

bdibsover 6 years ago

As someone currently trying to decide between GCP and AWS for a project, is this a regular occurrence?And for those who have used both, which would you go with today?

franky_gover 6 years ago

Had it affected all regions or just some?Is there another status page Google? Coz the last update I'm looking at...is dated on the 9th..

评论 #18429164 未加载

fulafelover 6 years ago

Offtopic but are there some documented exceptions to the "keep the original title" rule?

whatshisfaceover 6 years ago

Why do cloud providers have more global outages than major flagship websites like google.com?

评论 #18429651 未加载

fergieover 6 years ago

Things break after everybody has gone home on a Friday? 3 day disruption.

thomasflover 6 years ago

I'd like to upvote, but 666 points seemed relevant.

haosdentover 6 years ago

Time to use Mesos.

shiftnightover 6 years ago

I have a question. At what point does k8s make sense?I have a feeling that a microservice architecture is overkill for 99% of businesses. You can serve a lot of customers on a single node with the hardware available today. Often times, sharding on customers is rather trivial as well.Monolith for the win! Opinions?

评论 #18429922 未加载

评论 #18429341 未加载

评论 #18430557 未加载

评论 #18429253 未加载

评论 #18430525 未加载

评论 #18429295 未加载

aaaaaaaaaabover 6 years ago

Daily reminder that there's no "cloud", just other people's computers. ( ͡° ͜ʖ ͡°)

spullaraover 6 years ago

If a hosting service is down and nobody uses it, is there really any disruption?

38 comments

shareometryover 6 years ago

评论 #18430765 未加载

评论 #18430075 未加载

评论 #18430065 未加载

评论 #18430028 未加载

评论 #18430185 未加载

评论 #18429895 未加载

评论 #18429897 未加载

评论 #18431623 未加载

评论 #18431059 未加载

评论 #18429943 未加载

usmannkover 6 years ago

评论 #18429576 未加载

评论 #18430669 未加载

评论 #18430200 未加载

评论 #18430354 未加载

评论 #18431255 未加载

Jedi72over 6 years ago

"The data says engagement is down 46%, I think its time we drop the product."- Someone at Google right now, probably.

评论 #18429117 未加载

justinsbover 6 years ago

评论 #18429330 未加载

评论 #18429012 未加载

评论 #18429801 未加载

评论 #18429199 未加载

评论 #18434556 未加载

评论 #18430100 未加载

hacknatover 6 years ago

评论 #18430908 未加载

评论 #18429165 未加载

评论 #18429580 未加载

评论 #18429187 未加载

评论 #18429156 未加载

评论 #18430207 未加载

评论 #18429368 未加载

评论 #18429813 未加载

评论 #18429773 未加载

scarface74over 6 years ago

评论 #18430455 未加载

评论 #18430419 未加载

评论 #18430622 未加载

AlexB138over 6 years ago

This has been going on longer than three days. We have been dealing with this exact issue since at least Monday (11/5) morning in us-central1.

评论 #18428905 未加载

评论 #18434581 未加载

marcinzmover 6 years ago

>Nov 09, 2018 05:59>We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.Wait, did the people tasked with fixing this just take the weekend off?

评论 #18429094 未加载

评论 #18428965 未加载

rlancerover 6 years ago

Status page is inaccurate as issues doesn't only affect the web UI, the same operations are not functioning via the CLI.

评论 #18428650 未加载

评论 #18428907 未加载

评论 #18428820 未加载

评论 #18428644 未加载

scarface74over 6 years ago

评论 #18429029 未加载

评论 #18429072 未加载

评论 #18429003 未加载

评论 #18428993 未加载

评论 #18429018 未加载

评论 #18429338 未加载

评论 #18429095 未加载

评论 #18429036 未加载

评论 #18429099 未加载

评论 #18429002 未加载

评论 #18429687 未加载

评论 #18430001 未加载

评论 #18430446 未加载

评论 #18428994 未加载

评论 #18428989 未加载

rlancerover 6 years ago

UPDATE: Got some clarity, these issues are caused by "resource exhaustion" meaning there are no resources left to be allocated.

评论 #18430429 未加载

7ewisover 6 years ago

locusmover 6 years ago

thwy12321over 6 years ago

评论 #18428902 未加载

sladeyover 6 years ago

评论 #18430426 未加载

ernsheongover 6 years ago

评论 #18431418 未加载

013aover 6 years ago

评论 #18429926 未加载

spiderPigover 6 years ago

Our company is dependent on this as well and the way customer service has been handling this has been abysmal thus far.

qaqover 6 years ago

There is no magic public clouds have incredibly complex control planes and marketing fluff aside you would very likely experience much better uptime at singe top tier DC than @ a cloud provider.

arunodaover 6 years ago

gigatexalover 6 years ago

评论 #18429014 未加载

closeparenover 6 years ago

Doesn’t GKE “just” run an independent Kubernetes cluster on customer VMs? How is a widespread outage like this possible?

评论 #18428734 未加载

评论 #18429201 未加载

评论 #18428727 未加载

fizzledbitsover 6 years ago

wijowaover 6 years ago

nielsoleover 6 years ago

wb3techover 6 years ago

regnerbaover 6 years ago

Is this just about creating new pools? I haven't noticed an issue with our existing pools scaling.

评论 #18428745 未加载

_wmdover 6 years ago

When guerilla marketing backfires

bdibsover 6 years ago

As someone currently trying to decide between GCP and AWS for a project, is this a regular occurrence?And for those who have used both, which would you go with today?

franky_gover 6 years ago

Had it affected all regions or just some?Is there another status page Google? Coz the last update I'm looking at...is dated on the 9th..

评论 #18429164 未加载

fulafelover 6 years ago

Offtopic but are there some documented exceptions to the "keep the original title" rule?

whatshisfaceover 6 years ago

Why do cloud providers have more global outages than major flagship websites like google.com?

评论 #18429651 未加载

fergieover 6 years ago

Things break after everybody has gone home on a Friday? 3 day disruption.

thomasflover 6 years ago

I'd like to upvote, but 666 points seemed relevant.

haosdentover 6 years ago

Time to use Mesos.

shiftnightover 6 years ago

评论 #18429922 未加载

评论 #18429341 未加载

评论 #18430557 未加载

评论 #18429253 未加载

评论 #18430525 未加载

评论 #18429295 未加载

aaaaaaaaaabover 6 years ago

Daily reminder that there's no "cloud", just other people's computers. ( ͡° ͜ʖ ͡°)

spullaraover 6 years ago

If a hosting service is down and nobody uses it, is there really any disruption?