We are currently using perceptual hashes (e.g. phash.org) to do hundreds of thousands of image comparisons per
day.<p>As mentioned in another comment, you really have to test different hashing algorithms to find one that suits your needs best. In general though, I think it is in most cases not necessary to develop an algorithm from scratch :)<p>For us, the much more challenging part was/is to develop a system that can find similar images quickly. If you are interested in things like that have a look at VP/MVP data structures (e.g. <a href="http://pnylab.com/pny/papers/vptree/vptree/" rel="nofollow">http://pnylab.com/pny/papers/vptree/vptree/</a>, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7492" rel="nofollow">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.7...</a>).
Here is my question for OP. It seems like the image shrinking step coupled with the transformation to +/- values loses so much information that the hash would suffer from big false positive problem. I would have loved to see some data on this based on their own dataset.<p>To give a concrete example, I noticed recently that a lot of app store icons within a category look pretty similar. See for example this:<p><a href="https://twitter.com/acslater00/status/450127865682489344/photo/1" rel="nofollow">https://twitter.com/acslater00/status/450127865682489344/pho...</a><p>All of those checkmarks certainly seem to my naked eye like they the binary diff transformation would result in a very identical hash. It seems like the rounded corners in 'Finish' would blur out, and the texture in 'OmniFocus 2' would blur out, and the gradient in 'Clear' would look identical to a flat gradient on the right side of the checkmark.<p>Anyway, clever algorithm but curious how it works in practice on small icons?
Or you can use OpenCV with SIFT/SURF/ORB + KNN/RANSAC and have a very robust solution. Example [1].<p>OpenCV has awesome Python bindings (cv2), btw.<p>[1] <a href="http://stackoverflow.com/questions/2146542/opencv-surf-how-to-generate-a-image-hash-fingerprint-signature-out-of-the" rel="nofollow">http://stackoverflow.com/questions/2146542/opencv-surf-how-t...</a>
This is a bad approach for a couple reasons:
1) The total measurable space/information over a resized icon-size greyscale image is pretty small, so you run into a much higher likelyhood of collisions/false positives.<p>2) It's not too hard to program a Haar wavelet[1] transform[2] (basically iterative scaled thesholding). This has worked well over at IQDB[3], where they do reverse image lookups on databases of over several million images via a modified Haar wavelet.<p>You can't beat this algorithm for simplicity, though. Have you guys done any false positive checks with this algorithm? The saving grace might be that icons are fairly small and limited in detail/color.<p>[1] <a href="http://en.wikipedia.org/wiki/Haar_wavelet" rel="nofollow">http://en.wikipedia.org/wiki/Haar_wavelet</a><p>[2] <a href="http://stackoverflow.com/questions/1034900/near-duplicate-image-detection/1076507#1076507" rel="nofollow">http://stackoverflow.com/questions/1034900/near-duplicate-im...</a><p>[3] <a href="http://iqdb.org/" rel="nofollow">http://iqdb.org/</a>
If the has were really looked up by value, like in the example SQL:<p><pre><code> SELECT pk, hash, file_path FROM image_hashes
WHERE hash = '4c8e3366c275650f';
</code></pre>
This is screaming to be stored in a bloom filter.
Several years ago I tried matching images on histogram comparisons, I realized it wasn't going to be a perfect match, but actually had a pretty high degree of success on my data set (game tiles).
I can't work out why the width would be need to be 1px bigger in width? I don't see the explanation in #3.
Also, the array in #3 shows a 9x9 grid of 81 values when there are only 72 pixels?
Great write-up! We did something very similar when trying to find duplicate product images for a consumer review site we were working on. Our implementation desaturated the image, broke it into a fixed number of tiles, and generated a histogram of each tile's values. Then we put a threshold on the histogram values, and had each value in the tile represent a bit. Combine the bits, and we had a hash to store in the DB. Our hashes were a bit larger, so images within a certain hamming distance were flagged, rather than just looking for exact hash matches. It took quite a bit of tuning for us to get good results, but it seemed to work pretty well. Do you see many false positives with such a small processed image size (the 9x8 one, I mean)?
An idea: This would require more information storage, but would it be possible to hash an image and take snapshots of the hashing algorithm as it processes the image, say after each block of hashing (hashing digests a block - such as 64 bytes - at a time). Then simply compare the list of snapshots between two images and come up with a statistical threshold for a "similar image"? In the case of the two cats the images only differ in the nose, so the first half of the image up to the nose would produce the same list of snapshots.<p>You could also hash forwards, backwards and starting at several random midpoints to prevent someone from simply changing the first block to throw off the hashing algorithm.
Will the PIL import pick up WebP images as long as libwebp-dev is installed? I'm a WebP kinda guy and I love fuzzy math achievements like this, especially when I don't have to switch to Windows[1] to take advantage of it.<p>[1] <a href="http://www.visipics.info" rel="nofollow">http://www.visipics.info</a>
I wonder if this algorithm would work if someone uploaded a rotated version of a previously uploaded icon. As it only compares row adjacent values, if the rotation was 90 or 270 degrees I'm not sure it'd detect the duplicate.
Be careful not to get sued like that guy who posted an article about detecting songs and got sued by Shazam, even though everybody knows about FFT (I've been doing something like that for my BSc)
For Rubyists, check out the phashion gem, which is a light wrapper around pHash<p><a href="https://github.com/westonplatter/phashion" rel="nofollow">https://github.com/westonplatter/phashion</a>
I always wondered how partial duplicates on images were found. This wasn't as complicated as I was expecting. I assume some similar techniques are used for reverse image searching?
JPEG already has the low resolution information stored in an easily retrievable way. You could use that directly, no need to do the transforms. It would be a lot faster.
Although slower than this solution, I like this program: <a href="http://ssdeep.sourceforge.net/" rel="nofollow">http://ssdeep.sourceforge.net/</a>