TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Similarity join (Min-hash)

60 pointsby yellowflashalmost 3 years ago

3 comments

goldenkeyalmost 3 years ago
Brilliant stuff. Isn&#x27;t the XORing essentially just equivalent to a 1-time pad -- which isn&#x27;t very hash-like. I&#x27;d think using a PRNG [1] with the initial hash value as the seed, to generate more values, would be more effective.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pseudorandom_number_generator" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pseudorandom_number_generator</a>
评论 #31888851 未加载
评论 #31898297 未加载
maestalmost 3 years ago
&gt; The probability of C also having true in the row would be equal to Jaccard’s similarity!<p>That&#x27;s clear, call this P1<p>&gt; The probability that two documents A and B having the same representative token is, equal again to Jaccard’s similarity<p>That&#x27;s less clear (call this P2) and not equivalent to the first statement, afaict. In fact, this probability seems lower than the previous one. Consider the table:<p><pre><code> token A B a False True b True True </code></pre> This counts as matching under P1, but not under P2.<p>What am I missing here?<p>In order words, the number of cases where `reptoken(A) = reptoken(B)` is a subset of cases where `reptoken(A) is in B`
评论 #31975354 未加载
评论 #31898209 未加载
sulamalmost 3 years ago
I understand the theory here, and it seems like it works, but it sure would have been nice seeing some actual examples from a corpus of docs.