Total data loss after botched GitOps and failed backups

209 点作者 talonx超过 1 年前

37 条评论

pech0rin超过 1 年前

Very sorry to hear about this. Working in ops for 10+ years I know that drop kick in the stomach feeling when everything is going wrong and nothing is working as expected. I hope the author continues forward and takes this as a lesson.My own two cents which others have echoed is this is extremely overly complex setup for this situation. Its been told 100x times on HN about this but there is a reason. The more complex a system is the more points of failure you have. For all the tools you were running you should have had a team of ops people. In reality a database, some servers and a load balancer would get you 99% of the way there.Modern engineering is a dumpster fire of complexity mostly hocked by shills working to sell contracts to enterprises.You have learned this lesson the hard way but less moving parts means less things that can go wrong. Basics like monitoring, backup, testing etc go only so far if your system is a rube goldberg machine. Hope as an industry someday we can get back to simplicity

评论 #37295005 未加载

评论 #37290543 未加载

评论 #37297951 未加载

评论 #37300045 未加载

oneepic超过 1 年前

>I won’t personally be bringing back outdoors.lgbt or firefish.lgbt. Being an admin has been one of the most fulfilling things I have done in a long time and you all have made it such an amazing experience., however, I need to take a step back. I would love to hand the domains over to someone with a similar passion for creating a safe and welcoming community.I hope the author isn't only quitting because of a feeling of guilt or shame. If that's true, it's certainly ok to give it another try. Tech is toxic enough depending on where you look, and that's not even counting the self-hurt that some of us do to ourselves.

SeanAnderson超过 1 年前

Off-topic, but this website has some pretty crazy dark-patterns!I went to the user's profile and right-clicked their photo. The contextmenu action is intercepted and replaced with a custom menu mimicking the browser's context menu. This fake context menu has the option "Open in Window". I click it. The website doubles-down on the charade and opens a "popup" inside itself - maximize/minimize icons in the action bar and everything.It's the sort of thing you're taught phishing websites do to lure you into inputting your credentials into the wrong website.<a href="https://i.imgur.com/lMiH3uH.png" rel="nofollow noreferrer">https://i.imgur.com/lMiH3uH.png</a>

评论 #37290515 未加载

评论 #37290133 未加载

评论 #37290551 未加载

评论 #37297529 未加载

hnarn超过 1 年前

I’m trying very hard to not be flippant here, but I can’t shake the feeling that this Kubernetes norm has to end. I’m not saying Kubernetes needs to disappear, but people need to stop treating it as the new normal, as if VM:s and config management is somehow an outdated and incapable alternative.To me, this is an example of the complexity of Kubernetes coming back to blow your foot off. Remember, Kubernetes exists to make scaling and redundancy easier, but it’s only easier if you fully understand the implications of every configuration that you make.Complexity causes incidents, so my mantra will always be that if you propose to introduce complexity, have a justification ready for why its inherent risks are outweighed by the benefits.If you’re deploying your side projects in an infrastructure this complex, I would strongly suggest taking a step back and questioning if the same benefits really couldn’t be achieved in a simpler way.

评论 #37296688 未加载

评论 #37293318 未加载

评论 #37291888 未加载

gorgoiler超过 1 年前

Kudos to the admin for writing this up and making it public.Everyone’s lives will involve an irretrievable loss of something, on some scale, at some point. A friend, partner, parent, child — those are the big ones. Your home, job, your pet, a precious object (or in this case, data), or an enjoyment of something: the accumulation can be a lot to handle as time goes on, but it’s also human nature, and it’s important to learn to tackle grief without ignoring it.Grief is so core to being human that, thankfully, it is an aspect of life where you can find a lot of support. Religious communities, for example, can provide a lot of help that at the same time will be orthogonal to their core mission of worship. You can benefit from the former without having to engage with the latter. Don’t worry about being transactional in looking for help — people will want to help you.

zzyzxd超过 1 年前

Here we go again, a lot of comments saying k8s and gitops are too complex.IMHO, your ops team can accidentally delete your data one way or another regardless. Today it was moving files to a wrong directory in git, yesterday it could have been human executing `rm -rf` against the wrong path. The highlighted item here should be the bad backup process. You can't say your data is backed up unless you actually verify successful restoration from the backup data, on a regular basis. This was true 20 years ago for your SQL database instance, and this is still true today for your k8s PVCs.FWIW, I have been running GitOps long enough that any PR that moves files around raise the highest alerts in my head. The fundamental issue I have seen in many places is that engineers stores infra code in git and call it "GitOps". When you adopt GitOps concept, the most important thing is to train your engineers to switch to a different mental model, where your git repo is the _desired_ state of your infra -- it's not the actual state of the infra, and it's not a store of your imperative commands to manipulate the infra. When your desired state of your namespace is to not to exist, your gitops engine will try to make that happen!

pdntspa超过 1 年前

Wow... tooling that deletes a bunch of stuff because a manifest file is missing. Just... wow.I feel for the admin here. This seems equivalent to an accidental `sudo rm -rf /`.

评论 #37290010 未加载

评论 #37290486 未加载

评论 #37289634 未加载

评论 #37298093 未加载

评论 #37289572 未加载

评论 #37289787 未加载

评论 #37290697 未加载

评论 #37289760 未加载

TekMol超过 1 年前

I have been building and running web applications which are used by millions of users for over 20 years now. But reading this post, I feel like I am looking into a completely different world.None of the following is something which ever crossed my path:<pre><code> - GitOps repository cleanup - yaml manifests that create our namespaces - ArgoCD - Helm deployments - Persistent Volume Claims - Velero - Restic - PVC block volume data - Vultr </code></pre> I simply write data to MariaDB (a fork of MySql) and back it up via mysqldump. I wonder if that would have been an option here too?

评论 #37290542 未加载

评论 #37290590 未加载

评论 #37291675 未加载

simula67超过 1 年前

> While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data.Can someone explain this? How did they test restores, if the actual restore failed to come up with data?

评论 #37290014 未加载

评论 #37289932 未加载

评论 #37289870 未加载

评论 #37289625 未加载

mkl95超过 1 年前

> We use #Velero to capture backups of our cluster every 6 hours. From what I had seen our backups had been running successfully. I discovered once the incident started that backups had captured everything but the Persistent Volume Claim dataThis is why for non hobby stuff I advocate for RDS, or its counterpart on your favorite provider. Running a production db on Kubernetes is looking for trouble unless you really, really know what you are doing.

评论 #37290229 未加载

评论 #37290417 未加载

bg24超过 1 年前

Very sad. But also a miss on admin's part, even if unfortunate. They did not realize during manual restore testing that volume was not being backed up."Yes and also apparently no. We use #Velero to capture backups of our cluster every 6 hours. From what I had seen our backups had been running successfully. I discovered once the incident started that backups had captured everything but the Persistent Volume Claim data. While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data."

评论 #37289602 未加载

neilv超过 1 年前

Is it all fediverse data? If so, could the data be reconstructed from copies scattered amongst caches on other parties' servers?

评论 #37290919 未加载

评论 #37290258 未加载

pbjtime超过 1 年前

Anyone who would feel a "crippling loss" if their data disappeared must execute a real world real failure test at least annually. Disconnect the storage, attach new storage, and go.

E39M5S62超过 1 年前

Oof. I've been there. The rush of adrenaline and then just feeling completely wrecked and empty is something that sticks with you for a long time. Take care, lone admin - don't be too hard on yourself.

scarface_74超过 1 年前

The only data that I really care about are my photos and videos and they are backed up to four different providers.But this also made me realize how upset I would be if my blog over at micro.blog was accidentally deleted. It’s more of a journal of my digital nomadding with my wife across the US than anything else.I immediately went over and started a JSON export in bar (?) format.

评论 #37290071 未加载

birdyrooster超过 1 年前

I don't understand the backup failure, how did you not notice your PVCs weren't being backed up? Like wouldn't you realize they were too small or happening too fast? It seems like the root cause is people not having their work checked from first principles.

tamimio超过 1 年前

Honestly, it shouldn’t feel that bad, what was the loss again here? A bunch of users and their posts/memes?! They can create other users it should not be an issue. The real fuckups IMO are stuff that involves people’s lives, things like industrial automation, robotics, autonomous vehicles, aircraft systems and what not, unfortunately, usually you don’t see such an accountability like in OP when things go south in those domains.

评论 #37296351 未加载

dusted超过 1 年前

While my knee-jerk reaction is to blame bad backup practices, that's not what I believe happened here. Not entirely at least.The setup is too complex to easily verify and restore.Yes, a bespoke box is annoying and "inelegant" but it also works, and if it's backed up following even 80's tier best practice, it can be restored by anyone with a pulse.

lamontcg超过 1 年前

This all seems way overly complicated for what could probably be a few services running on a VM or four. Why does Kubernetes/Argo/Helm need to get involved? Why couldn't the whole architecture diagram look a lot more like HN? I feel like we've entirely lost our way with complexity.

评论 #37289853 未加载

评论 #37290514 未加载

lifty超过 1 年前

As DevOps Borat put it: “To make error is human. To propagate error to all server in automatic way is devops.”Sorry this happened, I’m sure it’s gut wrenching.

pnw超过 1 年前

That sucks. Everyone has been there. On my first sysadmin job I did an rm /* as root and then discovered my backups across the network to a remote tape unit had a buffering issue and were basically useless.

评论 #37289732 未加载

评论 #37289671 未加载

devmor超过 1 年前

Automation is wonderful for creating things, configuring things and moving things.Automation should never "clean up". Do your cleanup manually, or this is what you get.

评论 #37289530 未加载

评论 #37290251 未加载

评论 #37289587 未加载

hyperman1超过 1 年前

Backup and restores are hard.Last week, I upgraded my home laptop's SSD. I decided to take out the old one, do a fresh install on the new one, and restore the backup from a Synology NAS.The good news: Backup seems complete. It's not the first time I migrated like this, and an older hardware upgrade went a lot less well, so I am happy.The bad news: It was far from a trouble free process. I succeeded in the end, but had to restore a few times because the first tries were botched and there were some gotchas to learn.

mmcnl超过 1 年前

Unfortunately, anything that's powerful enough to simplify complex configuration is also powerful enough to accidentally make fatal errors.

redhale超过 1 年前

Here we go again, a lot of comments saying the giant ball of knives is too dangerous.I work with the giant ball of knives every day, wrapped in my custom-tailored full-body kevlar bubble, and I barely ever have any catastrophic life-threatening accidents!The giant ball of knives is not the issue. Your carelessness when working around the giant ball of knives is the issue. Look inward.

testemailfordg2超过 1 年前

One learning for all of us that I see every other day at work causing issues, the testing is not planned or not done based on what is present in production but something which makes the tester feel...Works for me.... Probably need a similar solution like Docker for this works for me disease in testing.

stym06超过 1 年前

That sucks! Now I'm thinking my github action that is doing a `git pull repo && cd repo/ && rm -rf dist/ && populate.sh && git push` in my bash script will totally go rogue one day and kill everything. Any idea, peeps?

评论 #37289554 未加载

评论 #37295868 未加载

huksley超过 1 年前

that's why DB should a permanent instance out of reach of CI/CD

roydivision超过 1 年前

Props to the author for being so open and honest, and for writing it up as a cautionary tail. None of us are perfect, thankfully. The stigma around failing should be eradicated from society.

serpix超过 1 年前

This would also mean they did not have a staging environment for verifying a botched deployment. Very unfortunate and a double disappointment for finding out backups were missing.

futuretaint超过 1 年前

hey, it's a mistake the admin will never make again, so. be a cup half full person... you don't have to worry about the data any more also.

readthenotes1超过 1 年前

It's odd to me when people say backups are a feature.They are not.Restoration, however, is.And has to be tested...

评论 #37289482 未加载

fnord77超过 1 年前

something like this almost happened where I worked - an atlantis/terraform apply started tearing down our cassandra cluster, which was not backed up.Caught it in time before it got the volumes

评论 #37300680 未加载

评论 #37298986 未加载

samcat116超过 1 年前

If this was with Argo shouldn’t finalizes prevented this?

kozzz超过 1 年前

At least they had fun with trendy tech. Who wants to ssh to servers any more? Throw in ten different tools you barely read documentation for, season it up with some yaml you found on gist.github and join the cool guys!

macintux超过 1 年前

The more layers of automation we add, the more invisible points of failure. Magic is great until it isn’t.I feel their pain. The sinking feeling when you realize that data is gone and not coming back is an awful experience.

评论 #37289614 未加载

评论 #37289783 未加载

评论 #37289624 未加载

shnkr超过 1 年前

Disclaimer: Author to a competitive k8s backup solution.I personally don't think Velero is a solution for production workloads or anything serious. Only an established backup company/devs will have expertise to implement and handle all cases and take care of all data loss scenarios. Ideally k8s authors should have stopped at providing a tool (they have snapshots like any db/fs) rather than writing their own backup piece. Unfortunately many of the industry solutions are wrapped over Velero except a few (two?), one I implemented from scratch for Commvault.