Detecting duplicate images with Python

231 点作者 iconfinder大约 11 年前

23 条评论

herrherr大约 11 年前

We are currently using perceptual hashes (e.g. phash.org) to do hundreds of thousands of image comparisons per day.As mentioned in another comment, you really have to test different hashing algorithms to find one that suits your needs best. In general though, I think it is in most cases not necessary to develop an algorithm from scratch :)For us, the much more challenging part was/is to develop a system that can find similar images quickly. If you are interested in things like that have a look at VP/MVP data structures (e.g. <a href="http://pnylab.com/pny/papers/vptree/vptree/" rel="nofollow">http://pnylab.com/pny/papers/vptree/vptree/</a>, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7492" rel="nofollow">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7...</a>).

评论 #7522622 未加载

评论 #7522487 未加载

acslater00大约 11 年前

Here is my question for OP. It seems like the image shrinking step coupled with the transformation to +/- values loses so much information that the hash would suffer from big false positive problem. I would have loved to see some data on this based on their own dataset.To give a concrete example, I noticed recently that a lot of app store icons within a category look pretty similar. See for example this:<a href="https://twitter.com/acslater00/status/450127865682489344/photo/1" rel="nofollow">https://twitter.com/acslater00/status/450127865682489344/pho...</a>All of those checkmarks certainly seem to my naked eye like they the binary diff transformation would result in a very identical hash. It seems like the rounded corners in 'Finish' would blur out, and the texture in 'OmniFocus 2' would blur out, and the gradient in 'Clear' would look identical to a flat gradient on the right side of the checkmark.Anyway, clever algorithm but curious how it works in practice on small icons?

评论 #7528314 未加载

评论 #7523450 未加载

评论 #7523441 未加载

评论 #7525792 未加载

steeve大约 11 年前

Or you can use OpenCV with SIFT/SURF/ORB + KNN/RANSAC and have a very robust solution. Example [1].OpenCV has awesome Python bindings (cv2), btw.[1] <a href="http://stackoverflow.com/questions/2146542/opencv-surf-how-to-generate-a-image-hash-fingerprint-signature-out-of-the" rel="nofollow">http://stackoverflow.com/questions/2146542/opencv-surf-how-t...</a>

评论 #7522972 未加载

cyanoacry大约 11 年前

This is a bad approach for a couple reasons: 1) The total measurable space/information over a resized icon-size greyscale image is pretty small, so you run into a much higher likelyhood of collisions/false positives.2) It's not too hard to program a Haar wavelet[1] transform[2] (basically iterative scaled thesholding). This has worked well over at IQDB[3], where they do reverse image lookups on databases of over several million images via a modified Haar wavelet.You can't beat this algorithm for simplicity, though. Have you guys done any false positive checks with this algorithm? The saving grace might be that icons are fairly small and limited in detail/color.[1] <a href="http://en.wikipedia.org/wiki/Haar_wavelet" rel="nofollow">http://en.wikipedia.org/wiki/Haar_wavelet</a>[2] <a href="http://stackoverflow.com/questions/1034900/near-duplicate-image-detection/1076507#1076507" rel="nofollow">http://stackoverflow.com/questions/1034900/near-duplicate-im...</a>[3] <a href="http://iqdb.org/" rel="nofollow">http://iqdb.org/</a>

rplnt大约 11 年前

I feel like this is doing something that was already done better (probably), but I enjoyed the article anyways. Nicely summarized what, why, and how.

评论 #7522189 未加载

SeoxyS大约 11 年前

If the has were really looked up by value, like in the example SQL:<pre><code> SELECT pk, hash, file_path FROM image_hashes WHERE hash = '4c8e3366c275650f'; </code></pre> This is screaming to be stored in a bloom filter.

ambirex大约 11 年前

Several years ago I tried matching images on histogram comparisons, I realized it wasn't going to be a perfect match, but actually had a pretty high degree of success on my data set (game tiles).

feelstupid大约 11 年前

I can't work out why the width would be need to be 1px bigger in width? I don't see the explanation in #3. Also, the array in #3 shows a 9x9 grid of 81 values when there are only 72 pixels?

评论 #7522802 未加载

szidev大约 11 年前

Great write-up! We did something very similar when trying to find duplicate product images for a consumer review site we were working on. Our implementation desaturated the image, broke it into a fixed number of tiles, and generated a histogram of each tile's values. Then we put a threshold on the histogram values, and had each value in the tile represent a bit. Combine the bits, and we had a hash to store in the DB. Our hashes were a bit larger, so images within a certain hamming distance were flagged, rather than just looking for exact hash matches. It took quite a bit of tuning for us to get good results, but it seemed to work pretty well. Do you see many false positives with such a small processed image size (the 9x8 one, I mean)?

评论 #7527982 未加载

aleh大约 11 年前

What about images which are mirror-flipped? Unless it is a symmetrical icon that particular algorithm will consider the two highly different.

amckenna大约 11 年前

An idea: This would require more information storage, but would it be possible to hash an image and take snapshots of the hashing algorithm as it processes the image, say after each block of hashing (hashing digests a block - such as 64 bytes - at a time). Then simply compare the list of snapshots between two images and come up with a statistical threshold for a "similar image"? In the case of the two cats the images only differ in the nose, so the first half of the image up to the nose would produce the same list of snapshots.You could also hash forwards, backwards and starting at several random midpoints to prevent someone from simply changing the first block to throw off the hashing algorithm.

mtct大约 11 年前

This is a very good article on images hashing. Thanks

d0ugie大约 11 年前

Will the PIL import pick up WebP images as long as libwebp-dev is installed? I'm a WebP kinda guy and I love fuzzy math achievements like this, especially when I don't have to switch to Windows[1] to take advantage of it.[1] <a href="http://www.visipics.info" rel="nofollow">http://www.visipics.info</a>

amoonki大约 11 年前

I wonder if this algorithm would work if someone uploaded a rotated version of a previously uploaded icon. As it only compares row adjacent values, if the rotation was 90 or 270 degrees I'm not sure it'd detect the duplicate.

avaku大约 11 年前

Be careful not to get sued like that guy who posted an article about detecting songs and got sued by Shazam, even though everybody knows about FFT (I've been doing something like that for my BSc)

评论 #7529494 未加载

danso大约 11 年前

For Rubyists, check out the phashion gem, which is a light wrapper around pHash<a href="https://github.com/westonplatter/phashion" rel="nofollow">https://github.com/westonplatter/phashion</a>

izzydata大约 11 年前

I always wondered how partial duplicates on images were found. This wasn't as complicated as I was expecting. I assume some similar techniques are used for reverse image searching?

Gravityloss大约 11 年前

JPEG already has the low resolution information stored in an easily retrievable way. You could use that directly, no need to do the transforms. It would be a lot faster.

评论 #7523519 未加载

kasperset大约 11 年前

Although slower than this solution, I like this program: <a href="http://ssdeep.sourceforge.net/" rel="nofollow">http://ssdeep.sourceforge.net/</a>

thearn4大约 11 年前

I've done a fair amount of work in image processing and computer vision, but haven't read much into hashing of images. Interesting article.

rompetoto大约 11 年前

Node implementation -> <a href="https://github.com/rompetoto/dhash" rel="nofollow">https://github.com/rompetoto/dhash</a>

johndavidback大约 11 年前

This might be a stupid question - but what if you have two icons, one with the cat with a black nose, one with the cat with a gray nose?

评论 #7529141 未加载

puppetmaster3大约 11 年前

The title should be: how to detect a duplicate image.No need to say, in Java, or in Python. It can be done in any lang.