Detecting duplicate images with Python

231 pointsby iconfinderabout 11 years ago

23 comments

herrherrabout 11 years ago

We are currently using perceptual hashes (e.g. phash.org) to do hundreds of thousands of image comparisons per day.As mentioned in another comment, you really have to test different hashing algorithms to find one that suits your needs best. In general though, I think it is in most cases not necessary to develop an algorithm from scratch :)For us, the much more challenging part was/is to develop a system that can find similar images quickly. If you are interested in things like that have a look at VP/MVP data structures (e.g. <a href="http://pnylab.com/pny/papers/vptree/vptree/" rel="nofollow">http://pnylab.com/pny/papers/vptree/vptree/</a>, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7492" rel="nofollow">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7...</a>).

评论 #7522622 未加载

评论 #7522487 未加载

acslater00about 11 years ago

Here is my question for OP. It seems like the image shrinking step coupled with the transformation to +/- values loses so much information that the hash would suffer from big false positive problem. I would have loved to see some data on this based on their own dataset.To give a concrete example, I noticed recently that a lot of app store icons within a category look pretty similar. See for example this:<a href="https://twitter.com/acslater00/status/450127865682489344/photo/1" rel="nofollow">https://twitter.com/acslater00/status/450127865682489344/pho...</a>All of those checkmarks certainly seem to my naked eye like they the binary diff transformation would result in a very identical hash. It seems like the rounded corners in 'Finish' would blur out, and the texture in 'OmniFocus 2' would blur out, and the gradient in 'Clear' would look identical to a flat gradient on the right side of the checkmark.Anyway, clever algorithm but curious how it works in practice on small icons?

评论 #7528314 未加载

评论 #7523450 未加载

评论 #7523441 未加载

评论 #7525792 未加载

steeveabout 11 years ago

Or you can use OpenCV with SIFT/SURF/ORB + KNN/RANSAC and have a very robust solution. Example [1].OpenCV has awesome Python bindings (cv2), btw.[1] <a href="http://stackoverflow.com/questions/2146542/opencv-surf-how-to-generate-a-image-hash-fingerprint-signature-out-of-the" rel="nofollow">http://stackoverflow.com/questions/2146542/opencv-surf-how-t...</a>

评论 #7522972 未加载

cyanoacryabout 11 years ago

This is a bad approach for a couple reasons: 1) The total measurable space/information over a resized icon-size greyscale image is pretty small, so you run into a much higher likelyhood of collisions/false positives.2) It's not too hard to program a Haar wavelet[1] transform[2] (basically iterative scaled thesholding). This has worked well over at IQDB[3], where they do reverse image lookups on databases of over several million images via a modified Haar wavelet.You can't beat this algorithm for simplicity, though. Have you guys done any false positive checks with this algorithm? The saving grace might be that icons are fairly small and limited in detail/color.[1] <a href="http://en.wikipedia.org/wiki/Haar_wavelet" rel="nofollow">http://en.wikipedia.org/wiki/Haar_wavelet</a>[2] <a href="http://stackoverflow.com/questions/1034900/near-duplicate-image-detection/1076507#1076507" rel="nofollow">http://stackoverflow.com/questions/1034900/near-duplicate-im...</a>[3] <a href="http://iqdb.org/" rel="nofollow">http://iqdb.org/</a>

rplntabout 11 years ago

I feel like this is doing something that was already done better (probably), but I enjoyed the article anyways. Nicely summarized what, why, and how.

评论 #7522189 未加载

SeoxySabout 11 years ago

If the has were really looked up by value, like in the example SQL:<pre><code> SELECT pk, hash, file_path FROM image_hashes WHERE hash = '4c8e3366c275650f'; </code></pre> This is screaming to be stored in a bloom filter.

ambirexabout 11 years ago

Several years ago I tried matching images on histogram comparisons, I realized it wasn't going to be a perfect match, but actually had a pretty high degree of success on my data set (game tiles).

feelstupidabout 11 years ago

I can't work out why the width would be need to be 1px bigger in width? I don't see the explanation in #3. Also, the array in #3 shows a 9x9 grid of 81 values when there are only 72 pixels?

评论 #7522802 未加载

szidevabout 11 years ago

Great write-up! We did something very similar when trying to find duplicate product images for a consumer review site we were working on. Our implementation desaturated the image, broke it into a fixed number of tiles, and generated a histogram of each tile's values. Then we put a threshold on the histogram values, and had each value in the tile represent a bit. Combine the bits, and we had a hash to store in the DB. Our hashes were a bit larger, so images within a certain hamming distance were flagged, rather than just looking for exact hash matches. It took quite a bit of tuning for us to get good results, but it seemed to work pretty well. Do you see many false positives with such a small processed image size (the 9x8 one, I mean)?

评论 #7527982 未加载

alehabout 11 years ago

What about images which are mirror-flipped? Unless it is a symmetrical icon that particular algorithm will consider the two highly different.

amckennaabout 11 years ago

An idea: This would require more information storage, but would it be possible to hash an image and take snapshots of the hashing algorithm as it processes the image, say after each block of hashing (hashing digests a block - such as 64 bytes - at a time). Then simply compare the list of snapshots between two images and come up with a statistical threshold for a "similar image"? In the case of the two cats the images only differ in the nose, so the first half of the image up to the nose would produce the same list of snapshots.You could also hash forwards, backwards and starting at several random midpoints to prevent someone from simply changing the first block to throw off the hashing algorithm.

mtctabout 11 years ago

This is a very good article on images hashing. Thanks

d0ugieabout 11 years ago

Will the PIL import pick up WebP images as long as libwebp-dev is installed? I'm a WebP kinda guy and I love fuzzy math achievements like this, especially when I don't have to switch to Windows[1] to take advantage of it.[1] <a href="http://www.visipics.info" rel="nofollow">http://www.visipics.info</a>

amoonkiabout 11 years ago

I wonder if this algorithm would work if someone uploaded a rotated version of a previously uploaded icon. As it only compares row adjacent values, if the rotation was 90 or 270 degrees I'm not sure it'd detect the duplicate.

avakuabout 11 years ago

Be careful not to get sued like that guy who posted an article about detecting songs and got sued by Shazam, even though everybody knows about FFT (I've been doing something like that for my BSc)

评论 #7529494 未加载

dansoabout 11 years ago

For Rubyists, check out the phashion gem, which is a light wrapper around pHash<a href="https://github.com/westonplatter/phashion" rel="nofollow">https://github.com/westonplatter/phashion</a>

izzydataabout 11 years ago

I always wondered how partial duplicates on images were found. This wasn't as complicated as I was expecting. I assume some similar techniques are used for reverse image searching?

Gravitylossabout 11 years ago

JPEG already has the low resolution information stored in an easily retrievable way. You could use that directly, no need to do the transforms. It would be a lot faster.

评论 #7523519 未加载

kaspersetabout 11 years ago

Although slower than this solution, I like this program: <a href="http://ssdeep.sourceforge.net/" rel="nofollow">http://ssdeep.sourceforge.net/</a>

thearn4about 11 years ago

I've done a fair amount of work in image processing and computer vision, but haven't read much into hashing of images. Interesting article.

rompetotoabout 11 years ago

Node implementation -> <a href="https://github.com/rompetoto/dhash" rel="nofollow">https://github.com/rompetoto/dhash</a>

johndavidbackabout 11 years ago

This might be a stupid question - but what if you have two icons, one with the cat with a black nose, one with the cat with a gray nose?

评论 #7529141 未加载

puppetmaster3about 11 years ago

The title should be: how to detect a duplicate image.No need to say, in Java, or in Python. It can be done in any lang.