I am currently evaluating GCP for two separate projects. I want to see if I understand this correctly:<p>1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).<p>2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.<p>3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.<p>4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.<p>5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.<p>6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.<p>I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.
We had an issue a few weeks ago where the google front-end servers were mangling responses from Pub/Sub and returning 502 responses, making the service completely unusable and knocking over a number of things we have running in production. Despite paying for enterprise support and having in a P1 ticket, we had to spend Friday to Sunday gathering evidence to prove to the support staff that there was indeed a problem, because their monitoring wasn't detecting it. Right now I'm doing something similar (and since Friday!) but for TLS issues they're having. Again, because their support reps don't believe there's a problem. There are so many more problems than they ever show on their status page...
Hi - I work at Google on GKE - sorry about the problems you're experiencing. There's a lot of people inside Google looking into this right now!<p>It looks like the UI issue was actually fixed, and that we just didn't update the status dashboard correctly. But we're double checking that and looking into some of the additional things you all have reported here.
Question to Google employees:<p>Why do you guys suffer global outages? This is your 2nd major global outage in less than 5 years. I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective. I need to see some blog posts about how you guys are rethinking whatever design can lead to this - twice - or you are never getting a cent of money under my control. You have the most feature rich cloud (particularly your networking products), but down time like this is unacceptable.
Say I were a CTO (I’m nowhere near it), why would I choose GCP over AWS or Azure? Even if after doing a technical assessment and I thought that GCP was technically slightly better, if something happened, the first question I would be asked is “why did you choose GCP over AWS?”<p>No one would ever ask why you chose AWS. The old “no one ever got fired for buying IBM”.<p>Even if you chose Azure because you’re a Microsoft shop, no one would question your choice of MS. Besides, MS is known for thier enterprise support.<p>From a developer/architect standpoint, I’ve been focused the last year on learning everything I could about AWS and chose a company that fully embraced it. AWS experience is much more marketable than GCP. It’s more popular than Azure too, but there are plenty of MS shops around that are using Azure.
>Nov 09, 2018 05:59<p>>We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.<p>Wait, did the people tasked with fixing this just take the weekend off?
A generic question: Our company is completely dependent on AWS. Sure we have taken all of the standard precautions for redundancy, but what happened here could just as easily happen with AWS - a needed resource is down globally.<p>What would a small business do as a contingency plan?
I honestly don't mind if providers have outages - we can't expect 100.00% accuracy, I know the systems I manage certainly don't achieve that.<p>One thing I <i>do</i> care about though, is root cause analysis. I love reading a good RCA, it restores my faith in the company and makes me trust them more.<p>(I'm not affect by the GKE outage so opinions may differ right now!)
Do not use GCP without paying for support.
We have had resource allocation errors for weeks, as have a lot of other people.
Check out the posts in their forum where folk on basic support get zero love.
<a href="https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/gce-discussion" rel="nofollow">https://groups.google.com/forum/?utm_medium=email&utm_source...</a>
Been trying to spin up vm instances all day, had to try every single zone just to get one up. Not only is this incredibly harmful to a technology business dependent on this infra, it wasnt obvious to me what the issue was until I tried creating instances. Nothing says, hey resources are constrained here, try this one. Just about ready to bite the bullet and move to aws.
Seems to be some weird underlying issue going on at GCP at the moment. Had cloud build webhooks returning a 500 error. Noticed we were at 255 images and deleting some fixed the issue. Created a P2 ticket about the issue before we managed to solve it and haven't had a response in 40+ hours.<p>The timeline of this disruption matches when we started experiencing cloud build errors.
"third consecutive day of service disruption" is not an accurate statement? Latest update was Nov 11 saying things resolved on Nov 9.<p><a href="https://status.cloud.google.com/incident/container-engine/18005" rel="nofollow">https://status.cloud.google.com/incident/container-engine/18...</a>
Cloud providers have all of the potential in the world to make each region truly isolated. I shouldn't have to architect my application to be multi-cloud, at least for stability reasons.<p>Yet, somehow every major cloud provider experiences global outages.<p>That old AWS S3 outage in us-east-1 was an interesting one; when it went down, many services which rely on S3 also went down, in other regions beside us-east-1 because they were using us-east-1 buckets. I have a feeling this is more common than you'd think; globally-redundant services which rely on some single point of geographical failure for some small part.
There is no magic public clouds have incredibly complex control planes and marketing fluff aside you would very likely experience much better uptime at singe top tier DC than @ a cloud provider.
The is not only GKE. But for GCE as well.
I cannot create instance is almost all zones. I tried both preemptible and normal as well.<p>Always saying resource not available.
My account is a pretty new account.<p>In contrast, one of my friend is having a pretty old account which is very active.
He has no such issue.<p>So I think due to this issue, Google has enabled some resource limitation for new accounts.<p>But they should properly communicate this issue.
Oh man must be a tough time to be an SRE at google cloud. But... they’re Google. They have been doing internal cloud for years and years. Borg — which K8s is a reimplementation if — has been the heart of Google for so long now you’d think they’d be able to architect their systems to have no outages whatsoever. I mean nobody is perfect but this looks bad.
As of this morning, I am <i>still</i> unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request"<p>An instance in us-central1-a has refused to start since last Thursday or Friday.<p>I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.<p>On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.<p>And yet, the status page says all services are available.<p>Is the typical of others' experiences?
Right now we're experiencing an issue where a small percentage of end users on our GKE site are getting super slow speeds. The issue is ISP related as they can switch to a 4G hot spot in the same location and get normal speeds... and inside our system the timing looks normal. So there's a slowdown either TO the load balancer or FROM the load balancer. Took a week to convince Google's support contractor to even believe it wasn't an issue with our site and their advice is generally along the lines of Turn it off and Turn it back on again (which might actually fix the problem) though that's easier said than done in GCP.
I use preemptible machines in autodialing and for first time did not have any machines available for multiple hours yesterday. I am wondering whether this falls under the normal preemptible behaviour or this service degradation.
If anyone is interested, here is my documented experience with this issue. I freaking love GCP and GKE, although I have not production environment as it was a HA cluster in us-central1. Working federation now.<p><a href="https://stackoverflow.com/questions/53244471/gke-cluster-wont-provision-in-any-region" rel="nofollow">https://stackoverflow.com/questions/53244471/gke-cluster-won...</a>
As someone currently trying to decide between GCP and AWS for a project, is this a regular occurrence?<p>And for those who have used both, which would you go with today?
I have a question. At what point does k8s make sense?<p>I have a feeling that a microservice architecture is overkill for 99% of businesses. You can serve a lot of customers on a single node with the hardware available today. Often times, sharding on customers is rather trivial as well.<p>Monolith for the win! Opinions?