TechEcho

5 comments

mikeceover 6 years ago

I see the terms "python" and "search" and "millions of sets" and think data science... except that in the data science contexts with which I'm familiar we're looking at billions of records among petabytes of data. I know the article is about what can be done on a laptop but I'm left wondering if this is a neat small-scale proof of concept or something that scales and I need to research more when I've had coffee rather than bourbon.

评论 #18484268 未加载

评论 #18484233 未加载

评论 #18484379 未加载

评论 #18484319 未加载

评论 #18484308 未加载

评论 #18484954 未加载

评论 #18487719 未加载

waterhouseover 6 years ago

Regarding the similarity function used: I haven't studied it myself, but read this a long time ago:<p><a href="http://www.benfrederickson.com/distance-metrics/" rel="nofollow">http://www.benfrederickson.com/distance-metrics/</a>

a-dubover 6 years ago

Which hash function are you using for the minhashes for the LSH benchmark? Example code from datasketch seems to indicate SHA-1. Is there good reason for that? Have you tried out murmur? I wonder if it improves runtime?

评论 #18487837 未加载

moultanoover 6 years ago

If you were curious about the reference to MinHash in the OP, I just wrote a gentle guide to the MinHash family of algorithms (including our recent research extending it to probability distributions.) <a href="https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsxyg4467-6/" rel="nofollow">https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsx...</a>

bra-ketover 6 years ago

How does it compare to Spotify “annoy” LSH <a href="https://github.com/spotify/annoy" rel="nofollow">https://github.com/spotify/annoy</a>

5 comments

mikeceover 6 years ago

评论 #18484268 未加载

评论 #18484233 未加载

评论 #18484379 未加载

评论 #18484319 未加载

评论 #18484308 未加载

评论 #18484954 未加载

评论 #18487719 未加载

waterhouseover 6 years ago

a-dubover 6 years ago

评论 #18487837 未加载

moultanoover 6 years ago

bra-ketover 6 years ago

How does it compare to Spotify “annoy” LSH <a href="https://github.com/spotify/annoy" rel="nofollow">https://github.com/spotify/annoy</a>

Show HN: All-pair similarity search on millions of sets in Python and on laptop

5 comments

Show HN: All-pair similarity search on millions of sets in Python and on laptop

5 comments