Google App Engine and Cloud Datastore Outages

231 pointsby tachionover 5 years ago

26 comments

sethvargoover 5 years ago

Hey everyone - Seth from Google here. We’re currently investigating this incident and hope to have it resolved soon. Check the status dashboard for more information (<a href="https://status.cloud.google.com/incident/cloud-datastore/19006" rel="nofollow">https://status.cloud.google.com/incident/cloud-datastore/190...</a>), and I’ll try to answer questions on this thread when I’m off mobile this morning.

评论 #21504618 未加载

评论 #21504691 未加载

pritambarhateover 5 years ago

<a href="https://status.cloud.google.com/summary" rel="nofollow">https://status.cloud.google.com/summary</a>Google App Engine seems to be a very fragile service. From Sept. 2019 It's going down every month. 10 hour+ outage in July, Sept. and Oct.For the premium they charge for App Engine, one would expect the service to be more reliable.

评论 #21511695 未加载

评论 #21506100 未加载

评论 #21505052 未加载

markdog12over 5 years ago

Our app is down. I can't even access any pages in Google Cloud Console. Timing out. Sometimes it completes showing all our clusters gone, then a timeout error. This is brutal. This isn't just GKE either...Edit: Things are working for us nowEdit: Still getting timeouts and service unavailableEdit: I'm getting 503 (service unavailable) from buckets, but nothing on the status page indicating there's any issue.Edit: Seems our Cloud SQL instance was restarted as wellEdit: Multiple restarts of our production databaseEdit: Dashboard finally updated to reflect growing # of services effectedEdit: This wasn't an "App Engine" incident. It was a very wide-ranging incident. Just change the title to "Google Cloud Incident" and be done with itEdit: Things have seemed to stabilize for usWas supposed to have today off with my family (Remembrance Day in Canada), but now I have to deal with support issues all day. Thanks Google!

评论 #21504629 未加载

评论 #21504641 未加载

gdhgdhover 5 years ago

Ouch when something as basic as this fails:11:17 $ gcloud container clusters list WARNING: The following zones did not respond: us-east1, us-east1-b, us-east1-d, us-east1-c, us-east1-a. List results may be incomplete.GCP Web Console is also really struggling - e.g. the homepage for our view of 'Cloud Functions' spins for a minute and tells the API is not enabled (it sure is).Ah there it is... <a href="https://status.cloud.google.com/incident/cloud-datastore/19006" rel="nofollow">https://status.cloud.google.com/incident/cloud-datastore/190...</a>

psankarover 5 years ago

One of our GKE clusters suddenly went missing today (from the console as well as kubectl) and we were scared for some time, panicking how the cluster got deleted.Google should have put up some kind of alert dialog in the console, saying that some services are experiencing a downtime of some kind.

评论 #21505180 未加载

charlieegan3over 5 years ago

I'm also seeing issues with GCE and GCS. Getting permissions errors and timeouts.GKE cluster API endpoints have high error rates or timeouts too."Multiple services reporting issues" on <a href="https://status.cloud.google.com/" rel="nofollow">https://status.cloud.google.com/</a> now. Can we update the title?

Legogrisover 5 years ago

This is pretty bad - on a regional cluster:$ kubectl get nodes The connection to the server XXX was refused - did you specify the right host or port?BigTable is also not responding for some time now.EDIT: This is us-east1. Responding again now.

评论 #21503927 未加载

tachionover 5 years ago

Aside from GKE, the chat.google.com and calendar.google.com are acting weird, with hangouts.google.com working just fine here. What's also interesting, the GCP Dashboard shows this issue being few days long now.EDIT: now the dashboard shows multiple services having issues, across the board.

评论 #21503876 未加载

savrajsinghover 5 years ago

Unfortunately I feel like google has one of these every 6 months, I really hope they resolve it. I’ve been an app engine user since 2008 and there are many mission-critical apps that are heavily impacted by any downtime. It usually ends up being networking configuration on their end in the US East region? A strange repeating pattern.

评论 #21504822 未加载

gwillzover 5 years ago

I can't tell if it's coincidence but I've had all of our GCE pull-queues failing with "transient failures" for pretty much the entire time GKE has been reporting this issue.Have they only _just_realised_ this is affecting GCE after all this time or has it only _just_started_ to affect GCE?

评论 #21505015 未加载

grn_11over 5 years ago

[edit] Incident logged on GAE <a href="https://status.cloud.google.com/incident/appengine/19013" rel="nofollow">https://status.cloud.google.com/incident/appengine/19013</a>Our Google App Engine Flex app is not working (. We are just getting 502 error. Locally Everything is working fine. But the service is not working. However the instance of the service stays in restarting mode and shows message "This VM is being restarted".As per this status the issues was supposed to be resolved on 1st Nov: <a href="https://status.cloud.google.com/incident/appengine/19012" rel="nofollow">https://status.cloud.google.com/incident/appengine/19012</a>

estebarbover 5 years ago

I'm really waiting for the postmortem. The first services down were networking/datastore, and some minutes later all the others started to fail. My hypothesis is that network failures prevented Paxos, a CP algorithm, to go forward, blocking writes.

评论 #21504366 未加载

solidasparagusover 5 years ago

It feels like GCP has not done a very good job of reducing blast radius in its services. Each time there is an outage there are so many downstream Google services affected.It's unbelievable that this is the second multi-region outage this year.

评论 #21505518 未加载

评论 #21504495 未加载

评论 #21504368 未加载

londons_exploreover 5 years ago

Very unlikley GKE is the root cause if Google Calendar is also affected.Google isn't using GKE internally for much.

评论 #21504645 未加载

评论 #21503845 未加载

评论 #21504568 未加载

alibertover 5 years ago

App Engine Flex, Cloud Storage, Cloud SQL, Networking seem okay in Europe (west-1).Our app is still up.

gregdoesitover 5 years ago

calendar.google.com is down for Google for Business customers. I am wondering how Google will compensate their paying customers?

评论 #21503902 未加载

评论 #21503810 未加载

someonehereover 5 years ago

It’s days like this I miss having data centers to manage. At least it was my fault the service went down. Nowadays I have to create redundancy across two different cloud providers to keep my business running. Thanks Google!For the record the price was appealing for us to start moving to GCP, but an outage like this is giving me seines thoughts.Am I right when I hear my other sys admin friends say GCP is like Gmail back in the day; still in beta?

评论 #21511950 未加载

thunderbongover 5 years ago

I know this is no way related, but there was this other submission which I found excellent, "Taking too much slack out of the rubber band" [0], and it just made me wonder...[0]:<a href="https://news.ycombinator.com/item?id=21502292" rel="nofollow">https://news.ycombinator.com/item?id=21502292</a>

woshea901over 5 years ago

We are still having issues with Google Cloud print. Anyone still having ongoing issues?

Havocover 5 years ago

Really wonder what it's like being in Google teams when this happens. Must be pretty intenseAlso - karma gods reward Google for manifests 3 :p

评论 #21504399 未加载

评论 #21504255 未加载

评论 #21504598 未加载

seanhandleyover 5 years ago

Hangouts was also affected. Seems ok now. Our business is in Europe.

评论 #21503951 未加载

brandrickover 5 years ago

Analytics was down for around 15-20 mins too it seems.

qaqover 5 years ago

I think my single droplet has better uptime than GAE

h1fraover 5 years ago

billing is down, making almost any operations in GCP dasboard fail :|

9dlover 5 years ago

> Incident began at 2019-11-04 11:46google can't fix something in 7 daysOh my~

评论 #21504383 未加载

RickJWagnerover 5 years ago

Hmmm, this will be expensive.I wonder what the 'lost revenue' costs will add up to. Also, I surely hope there aren't any medical/transportation/other critical things depending on this.