A bit of context: Microsoft developed PhotoDNA to identify illegal images like CSAM -- NCMEC maintains a database of PhotoDNA signatures, and many companies use this service to identify and remove these images.<p>Microsoft claims:<p>> A PhotoDNA hash is not reversible, and therefore cannot be used to recreate an image.<p>This project shows that this isn't quite true: machine learning can do a pretty good job of reproducing a thumbnail-quality images from a PhotoDNA signature.<p>There has been some discussion in the past on HN about PhotoDNA: <a href="https://news.ycombinator.com/item?id=28378254" rel="nofollow">https://news.ycombinator.com/item?id=28378254</a>. It has been claimed that PhotoDNA is reversible, but there was no public demonstration as far as I know.
On a side note, I find it kind of funny how, when using the model trained on Reddit, some of the outputs contain a quite readable "The image you are requesting does not exist or is no longer available" text, and a faint "imgur.com" watermark in the lower left corner.<p>For the former, I guess when training the original model, a bunch of the Reddit images weren't available at crawl time. Wouldn't it make sense to somehow weed those out from the data set before the training?
I'd say that the project <i>confirms</i> that PhotoDNA is not reversible.<p>This project generates discolored deformed thumbnails with maybe 12 pixels of resolution, and that's after addition of synthesized/imaginary data into them. Without priming by looking at the ground truth image, any attempts to guess what was in the images is just a Rorschach test.
I'm not a mathematician, but isn't there a direct correlation between reversibility and the unlikelihood of collisions? That is, if you have few to no collisions in the entire dataset of human-created images, it must be technically possible to reverse the hash into a reasonable thumbnail?
The requirement that changing the image a little bit changes the hash a little bit makes the image space smooth and more suitable for machine learning.
I wonder if, with a couple million passwords and their salted hashes, we can reconstruct something similar to the original password and reduce the search space somewhat.<p>I know it <i>should not</i> be possible, but, still, I’d love to play with that kind of dataset.