We are a small software company (2 people) and we've also had plenty of issues with Google over the years. Mostly related to Google Adwords. For example:<p><a href="https://successfulsoftware.net/2015/03/04/google-bans-hyperlinks/" rel="nofollow noreferrer">https://successfulsoftware.net/2015/03/04/google-bans-hyperl...</a><p><a href="https://successfulsoftware.net/2016/12/05/google-cpa-bidding-goes-wild/" rel="nofollow noreferrer">https://successfulsoftware.net/2016/12/05/google-cpa-bidding...</a><p><a href="https://successfulsoftware.net/2020/08/21/google-ads-can-charge-you-anything-they-like-for-a-click-on-their-partner-network/" rel="nofollow noreferrer">https://successfulsoftware.net/2020/08/21/google-ads-can-cha...</a><p><a href="https://successfulsoftware.net/2021/05/04/wtf-google-ads/" rel="nofollow noreferrer">https://successfulsoftware.net/2021/05/04/wtf-google-ads/</a><p>If Google have no interest in providing decent support to the author of the original article, who are paying megabucks to Google, what hope do small businesses like mine have?
Generally, I think over the last few years, GCP has lost its way.<p>There was a time several years ago where they were a meaningfully better option when looking at price / performance for compute / storage / bandwidth when compare to AWS. At the time, we did detailed performance testing and cost modeling to prove this for our workload (hundreds of compute engine instances etc).<p>Support back then was also excellent. One of our early tickets was an obscure networking issue. The request was quickly escalated then passed from engineers in different regions around the world until it was resolved. We were very impressed. It was a change on the GCP end that ended up being reverted. We quickly got to real engineers who competently worked the problem with us to resolution.<p>The sales team interactions were also better back then. We had a great sales rep who would quickly connect us with any internal resources we needed. The sales rep was a net positive and made our experience with GCP better.<p>Since then, AWS has certainly caught up and is every bit as good from a cost / performance standpoint. They remain years ahead on many managed services.<p>The GCP support experience has degraded significantly at this point. Most cases seem to go to outsourced providers who don’t seem able to see any data about the actual underlying GCP infrastructure. We too have detected networking issues that GCP does not acknowledge. The support folks we are dealing with don’t seem to have any greater visibility than we do. It’s pathetic and deeply frustrating. I’m sure it’s just as frustrating for them.<p>The sales experience is also significantly worse. Our current rep is a significant net negative.<p>We’ve made significant investments in GCP and we hate seeing this happen. While we would love to see things improve, we don’t see any signs of that actually happening. We are actively working to reduce our GCP spend.<p>A few years ago, I was a vocal GCP advocate. At this point, I’d have a hard time suggesting anyone build anything new on GCP.
No doubt all cloud providers have their problems.<p>For my day job, over the last 2 years we have discovered and reported multiple issues with Keyspaces, Amazon Aurora, and App Runner. In all cases these issues have resulted in performance degradation, and AWS support wasting our time sending us chasing our tails. After many weeks of escalation, we eventually ended up with project leads who confirmed the issues (some of which they were already aware of, yet the support teams had wasted our time anyway!) and (some of them) have since been resolved.<p>We are stuck with Keyspaces for the time being, but now refuse to use any non core services (EC2, EBS, S3). As soon as you venture away from those there be dragons.
It's hilarious people are bashing GCP for having one compute instance go down and the author acknowledges it's a rare event. On AWS I've got instances getting forced stopped or even straight disappearing all the time. 99.95% durability vs 99.999% is way different.<p>If they had the same architecture on AWS it would go down all the time IME. AWS primitives are way less reliable than GCP, according to AWS' docs and my own experiences.
I have a lot of interaction with Google Cloud Support, mostly around their managed services. I am genuinely not over-impressed with their service, considering with similar employers of size on AWS the support experience was always wonderful.<p>However, I will say if you are on Google Cloud and you have a positive interaction, make a big deal about someone helping you. Given the rarity it occurs, it’s not a big deal to really go out of your way to reward someone with some emphatic positive feedback. I’ve had four genuinely fantastic experiences and there’s always a message to a TAM that flows soon after. I hope more people like those I interacted with get rewarded and promoted.
I've had an experience with GCP that involved a very enterprise-y feature breaking in a way that clearly showed the feature never worked properly up until this point (aside from causing downtime when they tried to quietly fix it). GCP reps proceeded to remind everyone in the call in which they were supposed to explain what happened they were under NDA, because admitting to the above would've been a nightmare for regulated industries.
"On December 1st, at 8:52am PST, a box dropped offline; inaccessible. And then, instead of automatically coming back after failover — it didn’t. Our primary on-call engineer was alerted for this and dug in. While digging in, another box fell offline and didn’t come back"<p>This makes no sense. A machine restarted and you had catastrophic failure? VMs reboot time to time. But if you design your setup to completely destroy itself in this scenario, I don't think you will like a move to AWS, or god forbid, your own colo.
Interesting, I’m starting to think undocumented thresholds are quite common in GCP.<p>I experienced something similar with Clod Run: inexplicable scaling events based on CPU utilization and concurrent requests (the two metrics that regulate scaling according to their docs).<p>After a <i>lot</i> of back and forth with their (premium) support it turns out there are additional criteria, smthg related to request duration, but of course nobody was able to explain in details.
Sounds like a genuinely frustrating experience.<p>Bit confused about why nested virt has anything to do with their problems given that they aren’t using virt inside the VMs. Softlocks are a generic indication of a lack of forward progress.<p>Same confusion with the MMIO instructions comment. If that’s about instruction emulation, not sure why it matters where it happens? It’s both slow and bound for userspace anyway. If it’s supposed to be fast it should basically never be exiting the guest, let alone be emulated.<p>Sounds like the author is a bit frustrated and (understandably) grasping at whatever straws they can for that most recent incident.
> In 2022, we experienced continual networking blips from Google’s cloud products. After escalating to Google on multiple occasions, we got frustrated. So we built our own networking stack — a resilient eBPF/IPv6 Wireguard network that now powers all our deployments. Suddenly, no more networking issues.<p>My understanding is that the network is a VLAN programed via switches for VMs so when you create VPC, you're creating a VLAN probably.<p>So how can an overlay (UDP/Wire guard) be more reliable if the underlaying network isn't stable?<p>PS: Had even 1/10th of issues have happened on AWS with such a customer, their army of solution architects would be camping in conference rooms every other week reviewing architecture, taking support engineers on call and what not.
> In our experience, Google isn’t the place for reliable cloud compute<p>In the early days of cloud computing unreliability was understandable, but for Google to be frustrating its large customers in 2023 is a pretty bad look.<p>Curious to know if others have had similar experiences, or if the author was simply unlucky?
You should've migrated many months ago, if a cloud provider forces you to build your own networking or registry, you shouldn't use that cloud provider.
>We have automated systems in place to detect and resolve this. We’re notified in Discord<p>Isn't Discord hosted on GCP, too? If it goes down, monitoring also goes down?
> In our experience, Google isn’t the place for reliable cloud compute, and it’s sure as heck not the place for reliable customer support.<p>Always was, always will be. For them customers are always the last
It sounds like if you deploy on Railway they don't automatically handle a box dying (e.g. with K8s or other) -- "half the company was called in to go through runbooks." When they move to their own hardware, how will they handle that?
I wonder how many of these stories it would take before it starts affecting Google's bottom line. I've tinkered with GCP on small side projects, sure - but after exposure of these stories for over a decade in HN, I can never recommend GCP as a serious cloud alternative. I can't imagine I'm the only one in this boat.
If you go to Google’s issue tracker, you will find a lot of issues that were ignored. For example, this [0]issue that caused our ANR rate to dip.<p>[0] <a href="https://issuetracker.google.com/issues/230950647" rel="nofollow noreferrer">https://issuetracker.google.com/issues/230950647</a>
Maybe it is me but this doesn’t exactly reflect well on anyone. Isn’t the value prop of railway not having to worry about things like this? It doesn’t matter what the problem is - you shouldn’t be passing such problems on to customers at all.<p>I have worked on a product that caused such a spike on Google App Engine that within 20 minutes of it going public Google were on the phone explaining their pagers all went off, and in that case resolved to temporarily bump the quota up for 48 hours while a mutual workaround was implemented. The state of Google Cloud today seems just another classic case of the trend of blaming the customer.
Whishing all the best to the railway team, they really are building something nice. Hopefully the move to bare metal will mean price reductions for customers. I am philosophically opposed to cloud providers charging per user on top of very expensive resources but it might just be me.
I work at a company that spends <i>billions</i> on AWS and we intentionally have minimal gcp deployments and ban compute there because of how unreliable gcp is and how awful (outsourced) their support is. Gcp has excellent products but garbage operations. Who is running that clown show? It could have easily been the #2 cloud if they knew what they were doing
“ We paid them multiple millions of dollars per year”<p>Never heard of railway but paying this many $$$ per year should give you a dedicated support rep. But google doesn’t do support for anything lol
As someone who's into virtual worlds, and a user of Second Life, it's impressive to see how well those systems stay up. There hasn't been a total outage of Second Life in 5-10 years. Once Amazon's networking went down in a way that prevented new logins for a whole day, but existing logins remained. The 3D world, which has a lot of stuff going on even with no users around, continued to work. This is an extremely complex one of a kind system, and it just keeps cranking along. It's very distributed; one region (a 256x256m square) can crash and restart without taking down its neighbors. Users see the failed region as a square hole in the ground filled with water until the server restarts, which takes about two minutes. So outages are quite graceful. It's currently hosted on AWS, but it doesn't have to be.<p>What fails? The associated webcrap. The Marketplace, which is just a catalog and shopping cart. The forum system, which is outsourced to Invision, seems to fail several times a month. The messaging system, which is just a lightweight social network. The billing system. The outgoing payments system. Amazon's outgoing HTTPS proxy. All of those have failed several times in the last year. Even the JIRA system conked out once.<p>The quality of web software is underwhelming.
All these cloud service providers have bugs and issues.<p>But the problem with Google is that their support seems somehow disconnected from the real world. There is support, and they do respond to chats, calls, or emails. However, it often feels like I'm talking to someone who doesn't genuinely care about my concerns or do understand what I’m talking about.<p>Good support is hard to come by and hard to implement. So I really don't know what is missing in Google's support that exists in AWS support. Maybe because AWS support staff are trained to first put themselves in the customer's shoes and understand the problem from my perspective.
HOW TO CHOOSE A CLOUD PROVIDER<p>* AWS: you will pay to have stuff work properly and you like having customer service<p>* Azure: you hate yourself, you're running Windows or both<p>* Google: you're cheap enough that basic functionality is an optional extra<p>* Oracle: lol<p>* Hetzner: cheap, good service, the finest pets in the world, no cattle