TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Organizing Data Through the Lens of Deduplication

4 pointsby anishathalyealmost 5 years ago

1 comment

anishathalyealmost 5 years ago
Hi HN! Recently, I decided to take care of a task I had been procrastinating for a while: to organize and de-dupe data on our home file server. I was thinking of it as a mundane task that needed to get done at some point, but the problem turned out to be a bit more interesting than I initially thought.<p>There are tons of programs out there designed to find dupes, but most just spit out a huge list of duplicates and don&#x27;t help with the work that comes after that. This was problematic (we had ~500k dupes), so I wrote a small program to help me. The approach, at a high level, is to provide duplicate-aware analogs of coreutils, so e.g. a `psc ls` highlights duplicates and a `psc rm` deletes files only if they have duplicates elsewhere.<p>I thought it was a somewhat interesting problem and solution, so I wrote a little write-up of the experience. I&#x27;m curious to hear if any of you have faced similar problems, and how exactly you approached organizing&#x2F;de-duping data.