I realize now that I could have given the basic principles behind each of the two libraries compared here:<p>libPuzzle: Splits the image in blocks and compute the hash based on the relationships between the adjacent blocks brightness.<p>pHash: Computes the 8x8 DCT (<a href="http://en.wikipedia.org/wiki/Discrete_cosine_transform" rel="nofollow">http://en.wikipedia.org/wiki/Discrete_cosine_transform</a>) representation of an image (lowest frequencies of the image). It then sets the hash by comparing each of these 8×8 values to the mean DCT value (very resilient to non-structural changes in the image).<p>I updated the post with these informations
I just finished writing about distance hashing functions with a slightly different angle. I visualized the distances between a bunch of images using two different techniques, one of which was pHash (discussed in the parent article). Mine isn't quite as in-depth performance wise, but it makes for pretty pictures. Some of my work is here: <a href="http://www.josephcatrambone.com/?p=619" rel="nofollow">http://www.josephcatrambone.com/?p=619</a><p>I'm going to upload the SHA distance tonight.
libpHash is actually quite slow for what it does. I spent a fair amount of time investigating image hashing algorithms a few years ago and at that time I saw 10-20x improvement over libpHash just by implementing the similar phash algorithm described in Neil K's blog.* With Puzzle being both slower and dramatically less accurate on my body of test images. Perceptual hashing can be surprisingly lightweight -- by the end of the experiment I was really just benchmarking image loading libraries. If speed is a concern you are probably better off foregoing these libs and writing the 2-3 dozen lines of code (really!) it takes to roll your own, or better yet implementing a comparable, even more lightweight algorithm like dhash.<p>* <a href="http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html" rel="nofollow">http://www.hackerfactor.com/blog/index.php?/archives/529-Kin...</a><p>* The main difference being that libpHash applies a gaussian blur over the image, which can be made redundant by using a decent resampling algorithm.
I wrote this after spending way too much time on this problem earlier this year--nothing new, but is a fair chronology of my approach.<p><a href="http://download.picturelife.com.s3.amazonaws.com/press-kits/ImageSimilarityWhirlwind.pdf" rel="nofollow">http://download.picturelife.com.s3.amazonaws.com/press-kits/...</a>
We've been using phash for an image board for a while now and are quite happy with it. We only use it to detect reposts when someone uploads an image. It gives some false positives quite often, but that's totally okay for our use case. We specifically set it up to err on the safe side. Users are only presented with a "Are you sure your upload is not a duplicate?" message.<p>Currently we're just doing a `WHERE BIT_COUNT(images.phash ^ inputHash) < 12` in MySQL over 400k rows, which still works reasonably well (~200ms) given that it can't use an index for the XOR/BIT_COUNT operation. To my knowledge there's no way to speed up this query in MySQL, so if we continue to grow we probably have to write a small daemon that is able to search hashes more efficiently.
Is an image hash a very different beast from the hashing function used in passwords etc? In the latter we want large sensitivity to small changes, while in an image hash we want a measure sensitive to similarities.<p>Naively I'd expect image hashing to be like cross-correlation (non-linear) while password hashing can be done with shifts and modulo-2 (linear).
Be sure to enable javascript or you won't be able to see the gist with the results. (Obviously including a plain table in the post would be too much work:-\)