Brilliant stuff. Isn't the XORing essentially just equivalent to a 1-time pad -- which isn't very hash-like. I'd think using a PRNG [1] with the initial hash value as the seed, to generate more values, would be more effective.<p><a href="https://en.wikipedia.org/wiki/Pseudorandom_number_generator" rel="nofollow">https://en.wikipedia.org/wiki/Pseudorandom_number_generator</a>
> The probability of C also having true in the row would be equal to Jaccard’s similarity!<p>That's clear, call this P1<p>> The probability that two documents A and B having the same representative token is, equal again to Jaccard’s similarity<p>That's less clear (call this P2) and not equivalent to the first statement, afaict. In fact, this probability seems lower than the previous one. Consider the table:<p><pre><code> token A B
a False True
b True True
</code></pre>
This counts as matching under P1, but not under P2.<p>What am I missing here?<p>In order words, the number of cases where `reptoken(A) = reptoken(B)` is a subset of cases where `reptoken(A) is in B`