I see the terms "python" and "search" and "millions of sets" and think data science... except that in the data science contexts with which I'm familiar we're looking at billions of records among petabytes of data. I know the article is about what can be done on a laptop but I'm left wondering if this is a neat small-scale proof of concept or something that scales and I need to research more when I've had coffee rather than bourbon.
Regarding the similarity function used: I haven't studied it myself, but read this a long time ago:<p><a href="http://www.benfrederickson.com/distance-metrics/" rel="nofollow">http://www.benfrederickson.com/distance-metrics/</a>
Which hash function are you using for the minhashes for the LSH benchmark? Example code from datasketch seems to indicate SHA-1. Is there good reason for that? Have you tried out murmur? I wonder if it improves runtime?
If you were curious about the reference to MinHash in the OP, I just wrote a gentle guide to the MinHash family of algorithms (including our recent research extending it to probability distributions.)
<a href="https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsxyg4467-6/" rel="nofollow">https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsx...</a>
How does it compare to Spotify “annoy” LSH <a href="https://github.com/spotify/annoy" rel="nofollow">https://github.com/spotify/annoy</a>