I wrote something similar several years ago on minhashing for near duplicate detection.<p><a href="https://medium.com/@jonathankoren/near-duplicate-detection-b6694e807f7a" rel="nofollow">https://medium.com/@jonathankoren/near-duplicate-detection-b...</a>
Simhashing is a style of characterizing the similarity of data. The author begins with the idea that we can discard the first characters of { aaarock, aabjeep, aaareep } to prefer the latter two as most similar and concludes with computing the hamming distance of data.