TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Similarity join (Min-hash)

60 点作者 yellowflash将近 3 年前

3 条评论

goldenkey将近 3 年前
Brilliant stuff. Isn&#x27;t the XORing essentially just equivalent to a 1-time pad -- which isn&#x27;t very hash-like. I&#x27;d think using a PRNG [1] with the initial hash value as the seed, to generate more values, would be more effective.<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pseudorandom_number_generator" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pseudorandom_number_generator</a>
评论 #31888851 未加载
评论 #31898297 未加载
maest将近 3 年前
&gt; The probability of C also having true in the row would be equal to Jaccard’s similarity!<p>That&#x27;s clear, call this P1<p>&gt; The probability that two documents A and B having the same representative token is, equal again to Jaccard’s similarity<p>That&#x27;s less clear (call this P2) and not equivalent to the first statement, afaict. In fact, this probability seems lower than the previous one. Consider the table:<p><pre><code> token A B a False True b True True </code></pre> This counts as matching under P1, but not under P2.<p>What am I missing here?<p>In order words, the number of cases where `reptoken(A) = reptoken(B)` is a subset of cases where `reptoken(A) is in B`
评论 #31975354 未加载
评论 #31898209 未加载
sulam将近 3 年前
I understand the theory here, and it seems like it works, but it sure would have been nice seeing some actual examples from a corpus of docs.