TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: All-pair similarity search on millions of sets in Python and on laptop

123 点作者 ekzhu超过 6 年前

5 条评论

mikece超过 6 年前
I see the terms "python" and "search" and "millions of sets" and think data science... except that in the data science contexts with which I'm familiar we're looking at billions of records among petabytes of data. I know the article is about what can be done on a laptop but I'm left wondering if this is a neat small-scale proof of concept or something that scales and I need to research more when I've had coffee rather than bourbon.
评论 #18484268 未加载
评论 #18484233 未加载
评论 #18484379 未加载
评论 #18484319 未加载
评论 #18484308 未加载
评论 #18484954 未加载
评论 #18487719 未加载
waterhouse超过 6 年前
Regarding the similarity function used: I haven&#x27;t studied it myself, but read this a long time ago:<p><a href="http:&#x2F;&#x2F;www.benfrederickson.com&#x2F;distance-metrics&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.benfrederickson.com&#x2F;distance-metrics&#x2F;</a>
a-dub超过 6 年前
Which hash function are you using for the minhashes for the LSH benchmark? Example code from datasketch seems to indicate SHA-1. Is there good reason for that? Have you tried out murmur? I wonder if it improves runtime?
评论 #18487837 未加载
moultano超过 6 年前
If you were curious about the reference to MinHash in the OP, I just wrote a gentle guide to the MinHash family of algorithms (including our recent research extending it to probability distributions.) <a href="https:&#x2F;&#x2F;moultano.wordpress.com&#x2F;2018&#x2F;11&#x2F;08&#x2F;minhashing-3kbzhsxyg4467-6&#x2F;" rel="nofollow">https:&#x2F;&#x2F;moultano.wordpress.com&#x2F;2018&#x2F;11&#x2F;08&#x2F;minhashing-3kbzhsx...</a>
bra-ket超过 6 年前
How does it compare to Spotify “annoy” LSH <a href="https:&#x2F;&#x2F;github.com&#x2F;spotify&#x2F;annoy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;spotify&#x2F;annoy</a>