I personnally use fdupes.pl:<p><a href="http://www.perlmonks.org/?node_id=85202" rel="nofollow">http://www.perlmonks.org/?node_id=85202</a><p>Tested on many millions of files, works like a charm (though it can run out of memory on a 32 bits machine). I'm using the enhanced version here: <a href="http://www.perlmonks.org/?node_id=1099194" rel="nofollow">http://www.perlmonks.org/?node_id=1099194</a> which has an autodelete flag and prudently ignore symlinks.
I personally found fdupes to be slower and more limited than dupfiles [0].<p>I switched to dupfiles about a year ago and haven't had any problems yet.<p>[0]: <a href="http://liw.fi/dupfiles/" rel="nofollow">http://liw.fi/dupfiles/</a>
I used this when I was working on a product that used automated tests to upload files repeatedly during the day. The volume of test files was so great that it continually put pressure on the storage -- more pressure than the uploads from the actual users.<p>Fortunately the uploads were from a set of a few dozen static files, and de-duplicating the data via fdupes was able to drop disk usage by a factor of 20-50x.
I did something similar to this a while back, called qdupe[0], written in Python. It doesn't do the deleting for you, but is very fast at identifying duplicates if you have a lot to compare. Based on the fastdup algorithm.<p>[0] <a href="https://github.com/cwilper/qdupe" rel="nofollow">https://github.com/cwilper/qdupe</a>
Yeah, I wrote something similar a long time ago in Python: <a href="https://bitbucket.org/panzi/finddup/src" rel="nofollow">https://bitbucket.org/panzi/finddup/src</a>
It's not exactly clear, but I'm assuming this is some kind of automated hard-linking utility? Or does it use its own special magic? (filesystem type restrictions?)
not nearly as fancy, but it gets the job done for me:
<a href="http://www.commandlinefu.com/commands/view/3555/find-duplicate-files-based-on-size-first-then-md5-hash" rel="nofollow">http://www.commandlinefu.com/commands/view/3555/find-duplica...</a>