GCP Incidents

339 pointsby dbanover 1 year ago

27 comments

hermitcrabover 1 year ago

We are a small software company (2 people) and we've also had plenty of issues with Google over the years. Mostly related to Google Adwords. For example:<a href="https://successfulsoftware.net/2015/03/04/google-bans-hyperlinks/" rel="nofollow noreferrer">https://successfulsoftware.net/2015/03/04/google-bans-hyperl...</a><a href="https://successfulsoftware.net/2016/12/05/google-cpa-bidding-goes-wild/" rel="nofollow noreferrer">https://successfulsoftware.net/2016/12/05/google-cpa-bidding...</a><a href="https://successfulsoftware.net/2020/08/21/google-ads-can-charge-you-anything-they-like-for-a-click-on-their-partner-network/" rel="nofollow noreferrer">https://successfulsoftware.net/2020/08/21/google-ads-can-cha...</a><a href="https://successfulsoftware.net/2021/05/04/wtf-google-ads/" rel="nofollow noreferrer">https://successfulsoftware.net/2021/05/04/wtf-google-ads/</a>If Google have no interest in providing decent support to the author of the original article, who are paying megabucks to Google, what hope do small businesses like mine have?

评论 #38498208 未加载

annoyed_engover 1 year ago

Generally, I think over the last few years, GCP has lost its way.There was a time several years ago where they were a meaningfully better option when looking at price / performance for compute / storage / bandwidth when compare to AWS. At the time, we did detailed performance testing and cost modeling to prove this for our workload (hundreds of compute engine instances etc).Support back then was also excellent. One of our early tickets was an obscure networking issue. The request was quickly escalated then passed from engineers in different regions around the world until it was resolved. We were very impressed. It was a change on the GCP end that ended up being reverted. We quickly got to real engineers who competently worked the problem with us to resolution.The sales team interactions were also better back then. We had a great sales rep who would quickly connect us with any internal resources we needed. The sales rep was a net positive and made our experience with GCP better.Since then, AWS has certainly caught up and is every bit as good from a cost / performance standpoint. They remain years ahead on many managed services.The GCP support experience has degraded significantly at this point. Most cases seem to go to outsourced providers who don’t seem able to see any data about the actual underlying GCP infrastructure. We too have detected networking issues that GCP does not acknowledge. The support folks we are dealing with don’t seem to have any greater visibility than we do. It’s pathetic and deeply frustrating. I’m sure it’s just as frustrating for them.The sales experience is also significantly worse. Our current rep is a significant net negative.We’ve made significant investments in GCP and we hate seeing this happen. While we would love to see things improve, we don’t see any signs of that actually happening. We are actively working to reduce our GCP spend.A few years ago, I was a vocal GCP advocate. At this point, I’d have a hard time suggesting anyone build anything new on GCP.

supermattover 1 year ago

No doubt all cloud providers have their problems.For my day job, over the last 2 years we have discovered and reported multiple issues with Keyspaces, Amazon Aurora, and App Runner. In all cases these issues have resulted in performance degradation, and AWS support wasting our time sending us chasing our tails. After many weeks of escalation, we eventually ended up with project leads who confirmed the issues (some of which they were already aware of, yet the support teams had wasted our time anyway!) and (some of them) have since been resolved.We are stuck with Keyspaces for the time being, but now refuse to use any non core services (EC2, EBS, S3). As soon as you venture away from those there be dragons.

评论 #38498844 未加载

vel0cityover 1 year ago

It's hilarious people are bashing GCP for having one compute instance go down and the author acknowledges it's a rare event. On AWS I've got instances getting forced stopped or even straight disappearing all the time. 99.95% durability vs 99.999% is way different.If they had the same architecture on AWS it would go down all the time IME. AWS primitives are way less reliable than GCP, according to AWS' docs and my own experiences.

评论 #38499790 未加载

评论 #38503387 未加载

评论 #38503936 未加载

评论 #38506169 未加载

StopHammoTimeover 1 year ago

I have a lot of interaction with Google Cloud Support, mostly around their managed services. I am genuinely not over-impressed with their service, considering with similar employers of size on AWS the support experience was always wonderful.However, I will say if you are on Google Cloud and you have a positive interaction, make a big deal about someone helping you. Given the rarity it occurs, it’s not a big deal to really go out of your way to reward someone with some emphatic positive feedback. I’ve had four genuinely fantastic experiences and there’s always a message to a TAM that flows soon after. I hope more people like those I interacted with get rewarded and promoted.

评论 #38501184 未加载

363082a9-58a7over 1 year ago

I've had an experience with GCP that involved a very enterprise-y feature breaking in a way that clearly showed the feature never worked properly up until this point (aside from causing downtime when they tried to quietly fix it). GCP reps proceeded to remind everyone in the call in which they were supposed to explain what happened they were under NDA, because admitting to the above would've been a nightmare for regulated industries.

评论 #38497088 未加载

ransom1538over 1 year ago

"On December 1st, at 8:52am PST, a box dropped offline; inaccessible. And then, instead of automatically coming back after failover — it didn’t. Our primary on-call engineer was alerted for this and dug in. While digging in, another box fell offline and didn’t come back"This makes no sense. A machine restarted and you had catastrophic failure? VMs reboot time to time. But if you design your setup to completely destroy itself in this scenario, I don't think you will like a move to AWS, or god forbid, your own colo.

评论 #38499007 未加载

simo7over 1 year ago

Interesting, I’m starting to think undocumented thresholds are quite common in GCP.I experienced something similar with Clod Run: inexplicable scaling events based on CPU utilization and concurrent requests (the two metrics that regulate scaling according to their docs).After a lot of back and forth with their (premium) support it turns out there are additional criteria, smthg related to request duration, but of course nobody was able to explain in details.

评论 #38499054 未加载

评论 #38497760 未加载

strstrover 1 year ago

Sounds like a genuinely frustrating experience.Bit confused about why nested virt has anything to do with their problems given that they aren’t using virt inside the VMs. Softlocks are a generic indication of a lack of forward progress.Same confusion with the MMIO instructions comment. If that’s about instruction emulation, not sure why it matters where it happens? It’s both slow and bound for userspace anyway. If it’s supposed to be fast it should basically never be exiting the guest, let alone be emulated.Sounds like the author is a bit frustrated and (understandably) grasping at whatever straws they can for that most recent incident.

wg0over 1 year ago

> In 2022, we experienced continual networking blips from Google’s cloud products. After escalating to Google on multiple occasions, we got frustrated. So we built our own networking stack — a resilient eBPF/IPv6 Wireguard network that now powers all our deployments. Suddenly, no more networking issues.My understanding is that the network is a VLAN programed via switches for VMs so when you create VPC, you're creating a VLAN probably.So how can an overlay (UDP/Wire guard) be more reliable if the underlaying network isn't stable?PS: Had even 1/10th of issues have happened on AWS with such a customer, their army of solution architects would be camping in conference rooms every other week reviewing architecture, taking support engineers on call and what not.

评论 #38497711 未加载

评论 #38497659 未加载

nomilkover 1 year ago

> In our experience, Google isn’t the place for reliable cloud computeIn the early days of cloud computing unreliability was understandable, but for Google to be frustrating its large customers in 2023 is a pretty bad look.Curious to know if others have had similar experiences, or if the author was simply unlucky?

评论 #38498005 未加载

评论 #38498892 未加载

Kwpolskaover 1 year ago

You should've migrated many months ago, if a cloud provider forces you to build your own networking or registry, you shouldn't use that cloud provider.

评论 #38497748 未加载

评论 #38497806 未加载

kgeistover 1 year ago

>We have automated systems in place to detect and resolve this. We’re notified in DiscordIsn't Discord hosted on GCP, too? If it goes down, monitoring also goes down?

评论 #38504501 未加载

rurbanover 1 year ago

> In our experience, Google isn’t the place for reliable cloud compute, and it’s sure as heck not the place for reliable customer support.Always was, always will be. For them customers are always the last

评论 #38498821 未加载

tedd4uover 1 year ago

It sounds like if you deploy on Railway they don't automatically handle a box dying (e.g. with K8s or other) -- "half the company was called in to go through runbooks." When they move to their own hardware, how will they handle that?

评论 #38505312 未加载

doubloonover 1 year ago

"reasons why Oxide has a business #12390"

评论 #38498840 未加载

评论 #38501201 未加载

niuzetaover 1 year ago

I wonder how many of these stories it would take before it starts affecting Google's bottom line. I've tinkered with GCP on small side projects, sure - but after exposure of these stories for over a decade in HN, I can never recommend GCP as a serious cloud alternative. I can't imagine I'm the only one in this boat.

lawgimenezover 1 year ago

If you go to Google’s issue tracker, you will find a lot of issues that were ignored. For example, this [0]issue that caused our ANR rate to dip.[0] <a href="https://issuetracker.google.com/issues/230950647" rel="nofollow noreferrer">https://issuetracker.google.com/issues/230950647</a>

esafakover 1 year ago

When they say moving are off Google Cloud services to bare metal, where do they plan to move?

评论 #38496789 未加载

评论 #38497105 未加载

评论 #38497437 未加载

评论 #38498676 未加载

评论 #38504455 未加载

评论 #38497031 未加载

fidotronover 1 year ago

Maybe it is me but this doesn’t exactly reflect well on anyone. Isn’t the value prop of railway not having to worry about things like this? It doesn’t matter what the problem is - you shouldn’t be passing such problems on to customers at all.I have worked on a product that caused such a spike on Google App Engine that within 20 minutes of it going public Google were on the phone explaining their pagers all went off, and in that case resolved to temporarily bump the quota up for 48 hours while a mutual workaround was implemented. The state of Google Cloud today seems just another classic case of the trend of blaming the customer.

Syttenover 1 year ago

Whishing all the best to the railway team, they really are building something nice. Hopefully the move to bare metal will mean price reductions for customers. I am philosophically opposed to cloud providers charging per user on top of very expensive resources but it might just be me.

asylteltineover 1 year ago

I work at a company that spends billions on AWS and we intentionally have minimal gcp deployments and ban compute there because of how unreliable gcp is and how awful (outsourced) their support is. Gcp has excellent products but garbage operations. Who is running that clown show? It could have easily been the #2 cloud if they knew what they were doing

testernewsover 1 year ago

“ We paid them multiple millions of dollars per year”Never heard of railway but paying this many $$$ per year should give you a dedicated support rep. But google doesn’t do support for anything lol

评论 #38505339 未加载

Animatsover 1 year ago

As someone who's into virtual worlds, and a user of Second Life, it's impressive to see how well those systems stay up. There hasn't been a total outage of Second Life in 5-10 years. Once Amazon's networking went down in a way that prevented new logins for a whole day, but existing logins remained. The 3D world, which has a lot of stuff going on even with no users around, continued to work. This is an extremely complex one of a kind system, and it just keeps cranking along. It's very distributed; one region (a 256x256m square) can crash and restart without taking down its neighbors. Users see the failed region as a square hole in the ground filled with water until the server restarts, which takes about two minutes. So outages are quite graceful. It's currently hosted on AWS, but it doesn't have to be.What fails? The associated webcrap. The Marketplace, which is just a catalog and shopping cart. The forum system, which is outsourced to Invision, seems to fail several times a month. The messaging system, which is just a lightweight social network. The billing system. The outgoing payments system. Amazon's outgoing HTTPS proxy. All of those have failed several times in the last year. Even the JIRA system conked out once.The quality of web software is underwhelming.

评论 #38497569 未加载

评论 #38497276 未加载

评论 #38498171 未加载

评论 #38499481 未加载

评论 #38497151 未加载

评论 #38499584 未加载

tloganover 1 year ago

All these cloud service providers have bugs and issues.But the problem with Google is that their support seems somehow disconnected from the real world. There is support, and they do respond to chats, calls, or emails. However, it often feels like I'm talking to someone who doesn't genuinely care about my concerns or do understand what I’m talking about.Good support is hard to come by and hard to implement. So I really don't know what is missing in Google's support that exists in AWS support. Maybe because AWS support staff are trained to first put themselves in the customer's shoes and understand the problem from my perspective.

评论 #38500680 未加载

评论 #38499588 未加载

ghustoover 1 year ago

I know AWS isn't cool or sexy, but shit works.

评论 #38498402 未加载

davidgerardover 1 year ago

HOW TO CHOOSE A CLOUD PROVIDER* AWS: you will pay to have stuff work properly and you like having customer service* Azure: you hate yourself, you're running Windows or both* Google: you're cheap enough that basic functionality is an optional extra* Oracle: lol* Hetzner: cheap, good service, the finest pets in the world, no cattle

评论 #38498771 未加载

评论 #38498582 未加载

评论 #38498524 未加载