科技回声

1 comment

Hi HN! Recently, I decided to take care of a task I had been procrastinating for a while: to organize and de-dupe data on our home file server. I was thinking of it as a mundane task that needed to get done at some point, but the problem turned out to be a bit more interesting than I initially thought.<p>There are tons of programs out there designed to find dupes, but most just spit out a huge list of duplicates and don't help with the work that comes after that. This was problematic (we had ~500k dupes), so I wrote a small program to help me. The approach, at a high level, is to provide duplicate-aware analogs of coreutils, so e.g. a `psc ls` highlights duplicates and a `psc rm` deletes files only if they have duplicates elsewhere.<p>I thought it was a somewhat interesting problem and solution, so I wrote a little write-up of the experience. I'm curious to hear if any of you have faced similar problems, and how exactly you approached organizing/de-duping data.

Organizing Data Through the Lens of Deduplication

1 comment

Organizing Data Through the Lens of Deduplication

1 comment