This made me think of something, so bear with me for a second. I took Andrew Ng's Machine Learning class on Coursera, and one of the things he talked about was how, given a certain data set and a logistic regression approach, the algorithm might decide to just always give it a value of 1, since 98% of the time the value is one.<p>The examples given here were all predicted to have 95%+ similarity, but it seems to me that given this dataset, I could be reasonably convinced that it just assumes all images are reasonably similar. In other words, the author provides no counter-examples that demonstrate that the algorithm can detect <i>dissimilar</i> images, such as a photo of a car and a clipart of pizza. I would expect that to have very low similarity, and given the algorithm's demonstrated ability to discern between similar and dissimilar images, I would be more convinced of perceptual image hashing as a viable technique.
I've used this technique before for image searching, here's some tips:<p>The algorithm presented here can be summarized as "Take the low-frequency DCT components of the luminance channel, quantized down to one bit". Hashing is a very misleading term for what you get from it: In particular, it doesn't have anything like the uniform coverage of the output space you would expect from a hash function, as almost all encountered images exist in a very small range of hashes. Also, the bits in the hash are not all created equal; assuming you encode them in a zig-zag pattern truncation is equivalent to a low-pass filter, so if you want to make the hash different sizes while matching on the same image elements you'll want to alter the quantization amount, not the output DCT size (this is directly equivalent to what JPEG does, which uses a fixed DCT size but lossily quantizes the information in it to varying degrees for different compression quality settings). Experiment with different quantization techniques; you don't have to treat every bit in the output the same.<p>This technique is only suitable for a small subset of the image files that you will encounter in the real world: it's mostly unsuitable for anything but photographs and photograph-like images (3d renders, paintings, etc.) Many classes of image have no useful distinguishing information in the low-frequency luminance space, such as screenshots, pictures of text, line drawings, many forms of graph and diagram, most "clip art" style drawings, pictures that include layout elements in the image file (like most of the meme/sucessories/demotivator format pictures you'll see on imageboards). Of course, an image with a flat or heavily compressed luminance channel will yield no information at all.<p>You don't have to do anything special to match flipped images, as you should be using (iirc) DCT-II that has the same output for a mirror image. It's insensitive to scaling and mild translation.<p>For debugging purposes, note that the transformation performed is lossy but easily reversible: you can undo all the steps but quantization to generate a 'stereotype' image of a particular hash value and its neighbors. There is no more information available to the matching algorithm than you can see with your own eyes with the round-trip conversion, so if the round-tripped image preserves important distinguishing detail the matching will work and if it turns the image into a meaningless grey smear you'll get tons of false positive matches.
Is it different from existing implementations? Am I missing something?<p><a href="http://phash.org/" rel="nofollow">http://phash.org/</a><p><a href="http://hzqtc.github.io/2013/04/image-duplication-detection.html" rel="nofollow">http://hzqtc.github.io/2013/04/image-duplication-detection.h...</a><p><a href="http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html" rel="nofollow">http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-I...</a>
Hash functions impose practical mappings between unlike domains. When volatile (~1-to-1) identifying (crypto), when correlated (~n-to-n) perceiving (AI).<p>(Tweet summary of the article. I'm becoming increasingly fascinated by hash functions. I'm finding all important this tension between abstraction/correlation/perception & groundedness/volatility/identification.)<p><a href="https://twitter.com/elzr/status/497893639000190976" rel="nofollow">https://twitter.com/elzr/status/497893639000190976</a>
Also look at Semantic Hashing, which uses deep learning to compress an input (e.g. a document) into a low dimensional bit vector with fast lookup:<p><a href="http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pdf" rel="nofollow">http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pd...</a><p>The idea is that you could hash one billion objects, and do a very fast lookup of similar objects.<p>There is later academic work that refines this approach. For example, here is work applying it to image search: <a href="http://www.cs.utexas.edu/~grauman/temp/GraumanFergus_Hashing_chapterdraft.pdf" rel="nofollow">http://www.cs.utexas.edu/~grauman/temp/GraumanFergus_Hashing...</a><p>You can consider this a variant of Locality-Sensitive Hashing, where the hash function is specifically induced through a machine learning technique.
Does the term "perceptual hashing" only apply to the domain of images? What would you call this concept applied to text files? I am familiar with Levenshtein distance, but that family of algorithms provide a similarity score or value whereas I'd like a hash as the output.
How would an adversary break this?<p>My first idea would be to rotate it by something that's halfway between the steps used. Say, 360/32 degrees. How does that compare?<p>Also: because it discards the high frequency data one should be able to construct something like this: <a href="http://cvcl.mit.edu/hybrid/CatDogHybrid.jpg" rel="nofollow">http://cvcl.mit.edu/hybrid/CatDogHybrid.jpg</a> - where to us at a close distance it looks like one thing but to this it looks like something else.
If this sort of thing fascinates you, here's a related approach: <a href="http://grail.cs.washington.edu/projects/query/" rel="nofollow">http://grail.cs.washington.edu/projects/query/</a> .. for some reason numerous people were working on image search features like this (i.e. draw what you want to find) 10 years ago but they don't seem to have caught on other than with Google Images.
Very interesting approach.<p>I have worked with image feature extraction in the past. Although using DCT coefficients has been used as a way to analyze texture features, the idea idea of generating the hash (step 5) seems to be new.<p>I am curious however on why you are discarding color information. Usually for reverse image search this kind of information can be quite useful.
This seems to work well for modifications that are basically in-place and high frequency, but how well does it handle things like one image being a crop of another, or translation, or flipping?
Hashes are supposed to change a lot with small changes in the data source. I wouldn't call this hashing, but fingerprinting, and there're plenty of fingerprinting algorithms for images, some related to steganography.