科技回声

goldenkey将近 3 年前

Brilliant stuff. Isn't the XORing essentially just equivalent to a 1-time pad -- which isn't very hash-like. I'd think using a PRNG [1] with the initial hash value as the seed, to generate more values, would be more effective.<a href="https://en.wikipedia.org/wiki/Pseudorandom_number_generator" rel="nofollow">https://en.wikipedia.org/wiki/Pseudorandom_number_generator</a>

评论 #31888851 未加载

评论 #31898297 未加载

maest将近 3 年前

> The probability of C also having true in the row would be equal to Jaccard’s similarity!That's clear, call this P1> The probability that two documents A and B having the same representative token is, equal again to Jaccard’s similarityThat's less clear (call this P2) and not equivalent to the first statement, afaict. In fact, this probability seems lower than the previous one. Consider the table:<pre><code> token A B a False True b True True </code></pre> This counts as matching under P1, but not under P2.What am I missing here?In order words, the number of cases where `reptoken(A) = reptoken(B)` is a subset of cases where `reptoken(A) is in B`

评论 #31975354 未加载

评论 #31898209 未加载

sulam将近 3 年前

I understand the theory here, and it seems like it works, but it sure would have been nice seeing some actual examples from a corpus of docs.

Similarity join (Min-hash)

3 条评论

Similarity join (Min-hash)

3 条评论