TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Organizing Data Through the Lens of Deduplication

4 点作者 anishathalye将近 5 年前

1 comment

anishathalye将近 5 年前
Hi HN! Recently, I decided to take care of a task I had been procrastinating for a while: to organize and de-dupe data on our home file server. I was thinking of it as a mundane task that needed to get done at some point, but the problem turned out to be a bit more interesting than I initially thought.<p>There are tons of programs out there designed to find dupes, but most just spit out a huge list of duplicates and don&#x27;t help with the work that comes after that. This was problematic (we had ~500k dupes), so I wrote a small program to help me. The approach, at a high level, is to provide duplicate-aware analogs of coreutils, so e.g. a `psc ls` highlights duplicates and a `psc rm` deletes files only if they have duplicates elsewhere.<p>I thought it was a somewhat interesting problem and solution, so I wrote a little write-up of the experience. I&#x27;m curious to hear if any of you have faced similar problems, and how exactly you approached organizing&#x2F;de-duping data.