TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: All-pair similarity search on millions of sets in Python and on laptop

123 pointsby ekzhuover 6 years ago

5 comments

mikeceover 6 years ago
I see the terms "python" and "search" and "millions of sets" and think data science... except that in the data science contexts with which I'm familiar we're looking at billions of records among petabytes of data. I know the article is about what can be done on a laptop but I'm left wondering if this is a neat small-scale proof of concept or something that scales and I need to research more when I've had coffee rather than bourbon.
评论 #18484268 未加载
评论 #18484233 未加载
评论 #18484379 未加载
评论 #18484319 未加载
评论 #18484308 未加载
评论 #18484954 未加载
评论 #18487719 未加载
waterhouseover 6 years ago
Regarding the similarity function used: I haven&#x27;t studied it myself, but read this a long time ago:<p><a href="http:&#x2F;&#x2F;www.benfrederickson.com&#x2F;distance-metrics&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.benfrederickson.com&#x2F;distance-metrics&#x2F;</a>
a-dubover 6 years ago
Which hash function are you using for the minhashes for the LSH benchmark? Example code from datasketch seems to indicate SHA-1. Is there good reason for that? Have you tried out murmur? I wonder if it improves runtime?
评论 #18487837 未加载
moultanoover 6 years ago
If you were curious about the reference to MinHash in the OP, I just wrote a gentle guide to the MinHash family of algorithms (including our recent research extending it to probability distributions.) <a href="https:&#x2F;&#x2F;moultano.wordpress.com&#x2F;2018&#x2F;11&#x2F;08&#x2F;minhashing-3kbzhsxyg4467-6&#x2F;" rel="nofollow">https:&#x2F;&#x2F;moultano.wordpress.com&#x2F;2018&#x2F;11&#x2F;08&#x2F;minhashing-3kbzhsx...</a>
bra-ketover 6 years ago
How does it compare to Spotify “annoy” LSH <a href="https:&#x2F;&#x2F;github.com&#x2F;spotify&#x2F;annoy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;spotify&#x2F;annoy</a>