Hey everyone - Seth from Google here. We’re currently investigating this incident and hope to have it resolved soon. Check the status dashboard for more information (<a href="https://status.cloud.google.com/incident/cloud-datastore/19006" rel="nofollow">https://status.cloud.google.com/incident/cloud-datastore/190...</a>), and I’ll try to answer questions on this thread when I’m off mobile this morning.
<a href="https://status.cloud.google.com/summary" rel="nofollow">https://status.cloud.google.com/summary</a><p>Google App Engine seems to be a very fragile service. From Sept. 2019 It's going down every month. 10 hour+ outage in July, Sept. and Oct.<p>For the premium they charge for App Engine, one would expect the service to be more reliable.
Our app is down. I can't even access any pages in Google Cloud Console. Timing out. Sometimes it completes showing all our clusters gone, then a timeout error. This is brutal. This isn't just GKE either...<p>Edit: Things are working for us now<p>Edit: Still getting timeouts and service unavailable<p>Edit: I'm getting 503 (service unavailable) from buckets, but nothing on the status page indicating there's any issue.<p>Edit: Seems our Cloud SQL instance was restarted as well<p>Edit: Multiple restarts of our production database<p>Edit: Dashboard finally updated to reflect growing # of services effected<p>Edit: This wasn't an "App Engine" incident. It was a very wide-ranging incident. Just change the title to "Google Cloud Incident" and be done with it<p>Edit: Things have seemed to stabilize for us<p>Was supposed to have today off with my family (Remembrance Day in Canada), but now I have to deal with support issues all day. Thanks Google!
Ouch when something as basic as this fails:<p>11:17 $ gcloud container clusters list
WARNING: The following zones did not respond: us-east1, us-east1-b, us-east1-d, us-east1-c, us-east1-a. List results may be incomplete.<p>GCP Web Console is also really struggling - e.g. the homepage for our view of 'Cloud Functions' spins for a minute and tells the API is not enabled (it sure is).<p>Ah there it is... <a href="https://status.cloud.google.com/incident/cloud-datastore/19006" rel="nofollow">https://status.cloud.google.com/incident/cloud-datastore/190...</a>
One of our GKE clusters suddenly went missing today (from the console as well as kubectl) and we were scared for some time, panicking how the cluster got deleted.<p>Google should have put up some kind of alert dialog in the console, saying that some services are experiencing a downtime of some kind.
I'm also seeing issues with GCE and GCS. Getting permissions errors and timeouts.<p>GKE cluster API endpoints have high error rates or timeouts too.<p>"Multiple services reporting issues" on <a href="https://status.cloud.google.com/" rel="nofollow">https://status.cloud.google.com/</a> now. Can we update the title?
This is pretty bad - on a regional cluster:<p>$ kubectl get nodes
The connection to the server XXX was refused - did you specify the right host or port?<p>BigTable is also not responding for some time now.<p>EDIT: This is us-east1. Responding again now.
Aside from GKE, the chat.google.com and calendar.google.com are acting weird, with hangouts.google.com working just fine here. What's also interesting, the GCP Dashboard shows this issue being few days long now.<p>EDIT: now the dashboard shows multiple services having issues, across the board.
Unfortunately I feel like google has one of these every 6 months, I really hope they resolve it. I’ve been an app engine user since 2008 and there are many mission-critical apps that are heavily impacted by any downtime. It usually ends up being networking configuration on their end in the US East region? A strange repeating pattern.
I can't tell if it's coincidence but I've had all of our GCE pull-queues failing with "transient failures" for pretty much the entire time GKE has been reporting this issue.<p>Have they only _just_realised_ this is affecting GCE after all this time or has it only _just_started_ to affect GCE?
[edit] Incident logged on GAE <a href="https://status.cloud.google.com/incident/appengine/19013" rel="nofollow">https://status.cloud.google.com/incident/appengine/19013</a><p>Our Google App Engine Flex app is not working (. We are just getting 502 error. Locally Everything is working fine.
But the service is not working. However the instance of the service stays in restarting mode and shows message "This VM is being restarted".<p>As per this status the issues was supposed to be resolved on 1st Nov:
<a href="https://status.cloud.google.com/incident/appengine/19012" rel="nofollow">https://status.cloud.google.com/incident/appengine/19012</a>
I'm really waiting for the postmortem. The first services down were networking/datastore, and some minutes later all the others started to fail. My hypothesis is that network failures prevented Paxos, a CP algorithm, to go forward, blocking writes.
It feels like GCP has not done a very good job of reducing blast radius in its services. Each time there is an outage there are so many downstream Google services affected.<p>It's unbelievable that this is the second multi-region outage this year.
It’s days like this I miss having data centers to manage. At least it was my fault the service went down. Nowadays I have to create redundancy across two different cloud providers to keep my business running. Thanks Google!<p>For the record the price was appealing for us to start moving to GCP, but an outage like this is giving me seines thoughts.<p>Am I right when I hear my other sys admin friends say GCP is like Gmail back in the day; still in beta?
I know this is no way related, but there was this other submission which I found excellent, "Taking too much slack out of the rubber band" [0], and it just made me wonder...<p>[0]:<a href="https://news.ycombinator.com/item?id=21502292" rel="nofollow">https://news.ycombinator.com/item?id=21502292</a>
Hmmm, this will be expensive.<p>I wonder what the 'lost revenue' costs will add up to. Also, I surely hope there aren't any medical/transportation/other critical things depending on this.