37 pointsby kkmabout 4 years ago

2 comments

I wrote something similar several years ago on minhashing for near duplicate detection.<p><a href="https://medium.com/@jonathankoren/near-duplicate-detection-b6694e807f7a" rel="nofollow">https://medium.com/@jonathankoren/near-duplicate-detection-b...</a>

Nzenabout 4 years ago

Simhashing is a style of characterizing the similarity of data. The author begins with the idea that we can discard the first characters of { aaarock, aabjeep, aaareep } to prefer the latter two as most similar and concludes with computing the hamming distance of data.

Simhashing (Hopefully) Made Simple (2012)

2 comments

Simhashing (Hopefully) Made Simple (2012)

2 comments