Sharing details on a recent incident impacting one of our customers

315 点作者 nonfamous12 个月前

27 条评论

snewman12 个月前

Given the level of impact that this incident caused, I am surprised that the remediations did not go deeper. They ensured that the same problem could not happen again in the same way, but that's all. So some equivalent glitch somewhere down the road could lead to a similar result (or worse; not all customers might have the same "robust and resilient architectural approach to managing risk of outage or failure").Examples of things they could have done to systematically guard against inappropriate service termination / deletion in the future:1. When terminating a service, temporarily place it in a state where the service is unavailable but all data is retained and can be restored at the push of a button. Discard the data after a few days. This provides an opportunity for the customer to report the problem.2. Audit all deletion workflows for all services (they only mention having reviewed GCVE). Ensure that customers are notified in advance whenever any service is terminated, even if "the deletion was triggered as a result of a parameter being left blank by Google operators using the internal tool".3. Add manual review for any termination of a service that is in active use, above a certain size.Absent these broader measures, I don't find this postmortem to be in the slightest bit reassuring. Given the are-you-f*ing-kidding-me nature of the incident, I would have expected any sensible provider who takes the slightest pride in their service, or even is merely interested in protecting their reputation, to visibly go over the top in ensuring nothing like this could happen again. Instead, they've done the bare minimum. That says something bad about the culture at Google Cloud.

评论 #40474907 未加载

评论 #40471575 未加载

评论 #40471992 未加载

评论 #40471582 未加载

评论 #40475343 未加载

评论 #40475036 未加载

评论 #40477072 未加载

评论 #40475814 未加载

评论 #40476019 未加载

评论 #40471143 未加载

评论 #40475346 未加载

评论 #40471670 未加载

dekhn12 个月前

If you're a GCP customer with a TAM, here's how to make them squirm. Ask them what protections GCP has in place, on your account, that would prevent GCP from inadvertently deleting large amounts of resources if GCP makes an administrative error.They'll point to something that says this specific problem was alleviated (by deprecating the tool that did it, and automating more of the process), and then you can persist: we know you've fixed this problem, then followup: will a human review this large-scale deletion before the resources are deleted?From what I can tell (I worked for GCP aeons ago, and am an active user of AWS for even longer) GCP's human-based protection measures are close to non-existent, and much less than AWS. Either way, it's definitely worth asking your TAM about this very real risk.

评论 #40471537 未加载

评论 #40472900 未加载

评论 #40471553 未加载

tkcranny12 个月前

> ‘Google teams worked 24x7 over several days’I don’t know if they get what the seven means there.

评论 #40472181 未加载

评论 #40475073 未加载

评论 #40471479 未加载

评论 #40471588 未加载

评论 #40474000 未加载

评论 #40476020 未加载

评论 #40476370 未加载

tempnow98712 个月前

Wow - I was wrong. I thought this would have been something like terraform with a default to immediate delete with no recovery period or something. Still a default, but a third party thing and maybe someone in unisuper testing something and mis-scoping the delete.Crazy that it really was google side. UniSuper must have been like WHAT THE HELL?

评论 #40472010 未加载

评论 #40471590 未加载

评论 #40470381 未加载

gnabgib12 个月前

Related stories UniSuper members go a week with no account access after Google Cloud misconfig[0](186 points, 16 days ago, 42 comments) Google Cloud accidentally deletes customer's account [1](128 points, 15 days ago, 32 comments)[0]: <a href="https://news.ycombinator.com/item?id=40304666">https://news.ycombinator.com/item?id=40304666</a>[1]: <a href="https://news.ycombinator.com/item?id=40313171">https://news.ycombinator.com/item?id=40313171</a>

foobazgt12 个月前

Sounds like a pretty thorough review in that they didn't stop at just an investigation of the specific tool / process, but also examined the rest for any auto deletion problems and also confirmed soft delete behavior.They could have gone one step further by reviewing all cases of default behavior for anything that might be surprising. That said, it can be difficult to assess what is "surprising", as it's often the people who know the least about a tool/API who also utilize its defaults.

评论 #40471619 未加载

评论 #40470029 未加载

janalsncm12 个月前

I think it stretches credulity to say that the first time such an event happened was with a multi billion dollar mutual fund. In other words, I’m glad Unisuper’s problem was resolved, but there were probably many others which were small enough to ignore.I can only hope this gives GCP the kick in the pants it needs.

评论 #40470350 未加载

评论 #40477752 未加载

评论 #40475114 未加载

评论 #40470353 未加载

jawns12 个月前

> The customer’s CIO and technical teams deserve praise for the speed and precision with which they executed the 24x7 recovery, working closely with Google Cloud teams.I wonder if they just get praise in a blog post, or if the customer is now sitting on a king's ransom in Google Cloud credit.

评论 #40471636 未加载

评论 #40473186 未加载

postatic12 个月前

Uni super customer here in Aus. Didn’t know what it was but kept receiving emails every day when they were trying to resolve this. Only found out from news on what’s actually happened. Feels like they downplayed the whole thing as “system downtime”. Imagine something actually happened to people’s money and billions of dollars that were saved as their superannuation fund.

评论 #40560762 未加载

lukeschlather12 个月前

The initial statement on this incident was pretty misleading, it sounded like Google just accidentally deleted an entire GCP account. Reading this writeup I'm reassured, it sounds like they only lost a region's worth of virtual machines, which is absolutely something that happens (and that I think my systems can handle without too much trouble.) The original writeup made it sound like all of their GCS buckets, SQL databases, etc. in all regions were just gone which is a different thing and something I hope Google can be trusted not to do.

评论 #40471763 未加载

walrus0112 个月前

The idea that you could have an automated tool delete services at the end of a term for a corporate/enterprise customer of this size and scale is absolutely absurd and inexcusable. No matter whether the parameter was set correctly or incorrectly in the first place. It should go through several levels of account manager/representative/management for manual review by a human at the google side before removal.

hiddencost12 个月前

> It is not a systemic issue.I kinda think the opposite. The culture that kept these kinds of problems at bay has largely left the company or stopped trying to keep it alive, as they no longer really care about what they're building.Morale is real bad.

kjellsbells12 个月前

Interesting, but I draw different lessons from the post.Use of internal tools. Sure, everyone has internal tools, but if you are doing customer stuff, you really ought to be using the same API surface as the public tooling, which at cloud scale is guaranteed to have been exercised and tested much more than some little dev group's scripts. Was that the case here?Passive voice. This post should have a name attached to it. Like, Thomas Kurian. Palming it off to the anonymous "customer support team" still shows a lack of understanding of how trust is maintained with customers.The recovery seems to have been due to exceptional good fortune or foresight on the part of the customer, not Google. It seems that the customer had images or data stored outside of GCP. How many of us cloud users could say that? How many of us cloud users have encouraged customers to move further and deeper along the IaaS > PaaS > SaaS curve, making them more vulnerable to total account loss like this? There's an uncomfortable lesson here.

评论 #40472827 未加载

JCM912 个月前

The quality and rigor of GCP’s engineering is not even remotely close to that of an AWS or Azure and this incident shows it.

评论 #40471144 未加载

评论 #40470585 未加载

评论 #40470955 未加载

cebert12 个月前

> “Google Cloud continues to have the most resilient and stable cloud infrastructure in the world.”I don’t think GPC has that reputation compared to AWS or Azure. They aren’t at the same level.

评论 #40471789 未加载

评论 #40472445 未加载

评论 #40472016 未加载

评论 #40470653 未加载

jwnin12 个月前

End of day Friday disclosure before a long holiday weekend; well timed.

lopkeny12ko12 个月前

> Google Cloud services have strong safeguards in place with a combination of soft delete, advance notification, and human-in-the-loop, as appropriate.I mean, clearly not? By Google's own admission, in this very article, the resources were not soft deleted, no advance notification was sent, and there was no human in the loop for approving the automated deletion.And Google's remediation items include adding even more automation for this process. This sounds totally backward to me. Am I missing something?

评论 #40471325 未加载

xyst12 个月前

Transparency for Google is releasing this incident report on the Friday of a long weekend [in the US].I wonder if UniSuper was compensated for G’s fuckup.“A single default parameter vs multibillion organization. The winner may surprise you!1”

l00tr12 个月前

if it would be small or medium buisness google wouldnt even care

sgt10112 个月前

Super motivating to have off cloud backup strategies...

评论 #40470851 未加载

nurettin12 个月前

It sounds like a giant PR piece about how Google is ready to respond to a single customer and is ready to work through their problems instead of creating an auto-response account suspension infinite loop nightmare.

maximinus_thrax12 个月前

> Google Cloud continues to have the most resilient and stable cloud infrastructure in the world.As a company, Google has a lot of work to do about its customer care reputation regardless of what some metrics somewhere say about who's cloud is more reliable or not. I would not trust my business to Google Cloud, I would not trust anything with money to anything with the Google logo. Anyone who's been reading hacker news for a couple of years can remember how many times folks were asking for insider contacts to recover their accounts/data. Extrapolating this to a business would keep me up at night.

mannyv12 个月前

I guessed it was provisioning or keys. Looks like I was somewhat correct!

logrot12 个月前

Executive summary?

评论 #40474558 未加载

评论 #40473809 未加载

emmelaich12 个月前

Using "TL;DR" in professional communication is a little unprofessional.Some non-nerd exec is going to wonder what the heck that means.

评论 #40473768 未加载

noncoml12 个月前

What surprises me the most is that the customer managed to actually speak to a person from Google support. Must have been a pretty big private cloud deployment.Edit: saw from the other replies that the customer was Unisuper. No wonder they managed to speak to an actual person.

mercurialsolo12 个月前

Only if internal tools went thru the same scrutiny as public tools.More often than not critical parameters or mis-configurations happen because of internal tools which work on unpublished params.Internal tools should be treated as tech debt. You won't be able to eliminate issues but vastly reduce the surface area of errors.