Finding near-duplicates with Jaccard similarity and MinHash

247 点作者 brianyu811 个月前

14 条评论

tpoacher11 个月前

One thing many people miss about set based metrics like the jaccard similarity (aka Tanimoto coefficient) and F1 score (aka Dice coefficient), is that they can also be used identically with fuzzy sets.The only complication is that you then need to choose a suitable T-Norm / T-Conorm pair, which express the concept of intersection and union for fuzzy sets, and there's an infinite family of them. But that's a good thing, since you get to pick the pair with your desired semantics.I wrote about this ([0][1]) in the context of validating medical image segmentations when both the segmentation and ground truth are probabilistic/fuzzy rather than binary masks.Otherwise what most people do instead is to simply threshold at 0.5 to obtain binary sets for use with the binary variants of the jaccard / dice coefficients. Which apparently decreases the precision of your validation operator by 2 orders of magnitude. It's like, you publish your algorithm claiming it's better than SOTA by 0.001, ignoring the fact that your validation operator has an error margin of 0.1 ...[0] <a href="https://link.springer.com/chapter/10.1007/978-3-319-46723-8_42" rel="nofollow">https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...</a>[1] <a href="https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-088ea28806c5" rel="nofollow">https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...</a>

BiteCode_dev11 个月前

I worked with a client that implemented their own Python version of this to deduplicate citizen entries in a big french gov database. It worked well.Of course, nowaday I would probably just tell them to use datasketch (<a href="https://pypi.org/project/datasketch/" rel="nofollow">https://pypi.org/project/datasketch/</a>).With this trip to memory lane, I looked around a little, and noticed people are still creating new stuff on the topic. E.G:<a href="https://pypi.org/project/rensa/" rel="nofollow">https://pypi.org/project/rensa/</a>Which is basically a more specialized but faster version of datasketch minhash, written in rust, with a little python on top.

评论 #40874034 未加载

评论 #40875935 未加载

pkeenan1111 个月前

Holy synchronicity Batman! I just implemented a minhash system that some may find interesting.The problem is to find (pseudo) inverses of many different proper submatrices of a big square matrix.Certain matrix identities (Woodbury, banachiewicz) allow you to update the inverse of a “close” submatrix to cheaply calculate a new one.So you store the inverses you’ve calculated already, with their row/column indices as keys. Then for each new submatrix you want to find a close already-computed inverse to update from.This is solved with minhashing. You minwise hash the indices so that close matrices are likely to have the same hash.In my particular implementation I did a “multi-resolution” hash so that I can change the selectiveness of the search as the number of prior computed inverses grows

a-dub11 个月前

a little backstory (that seems to be missing from the blog post). this is a technique that was invented in the early days of google for deduplicating the crawl set (turns out that building an llm and building a plain ol' web text index look awfully similar; i wonder why that is?!). you can read about it in depth in jeffrey ullman's free book "mining massive datasets" which describes many of the really cool and impressive techniques that were used to prepare an index build for the entire internet when such a thing was really hard.you can find the relevant material for free by searching "chapter 3 pdf mmds ullman"enjoy!edit: oh no! i'm wrong! according to wikipedia it was invented at dec for altavista. <a href="https://en.wikipedia.org/wiki/MinHash" rel="nofollow">https://en.wikipedia.org/wiki/MinHash</a> either way there's a nice description in the ullman book and they do describe how it was used at google as well.

carnewal11 个月前

As I got into Minhash & its variants I found it hard to wrap my head around this, so I'm building an online visualizer<a href="https://websla.sh/tools/minhash" rel="nofollow">https://websla.sh/tools/minhash</a>It's not really complete, I'd like to show Jaccard Similarity calculations among other things, but this already allows you to enter multiple strings and see with your own eyes what a "minhash" actually is.

评论 #40884672 未加载

ashvardanian11 个月前

Hashing or tiny neural nets combined with a Vector Search engine with Tanimoto/Jaccard is a very common deduplication strategy for large datasets. It might be wiser than using linear-complexity MapReduce operations.There is a nice Google project using 0.5 M parameter RETSim model and the USearch engine for that: <a href="https://github.com/google/unisim">https://github.com/google/unisim</a>

vivzkestrel11 个月前

I have this problem right now in postgres. I have 600000 feed_items with the schema (feed_item_id uuid, author varchar, content text, guid varchar, link varchar, title varchar, summary text, feed_id integer) The content and summary columns especially for some of the news items are highly similar but not equal. For any given 2 such news items, I am trying to cut them down to 1. Any ideas?

评论 #40874652 未加载

评论 #40874094 未加载

评论 #40874362 未加载

评论 #40881460 未加载

评论 #40875203 未加载

评论 #40877129 未加载

ryantwolf10 个月前

Love the article! My team at nvidia recently released a GPU accelerated version of the fuzzy deduplication algorithm described, and I figure this community might be interested.Here's the repo: <a href="https://github.com/NVIDIA/NeMo-Curator/">https://github.com/NVIDIA/NeMo-Curator/</a>Some documentation on the fuzzy dedup scripts: <a href="https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/gpudeduplication.html" rel="nofollow">https://docs.nvidia.com/nemo-framework/user-guide/latest/dat...</a>And a Python example: <a href="https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py">https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fu...</a>Would be interested in any feedback from the folks here.

dekhn10 个月前

This is one of those techniques I don't understand when I read it as text, but will instantly absorb when I have a working code example and pass some of my own data through it a few times, inspecting the working process.I first learned about this technique from Douglas Eck (<a href="https://research.google/people/douglas-eck/" rel="nofollow">https://research.google/people/douglas-eck/</a>), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better.

评论 #40884945 未加载

derefr11 个月前

As a document clustering / dataset deduplication technique, how well does "throwing ML at the problem" (e.g. using a pre-trained LLM encoder to generate vector embeddings of your documents, and then sticking those vectors into a vector DB and doing k-means to cluster the vectors) compare, quality-wise and performance-wise, to these simpler discrete-algorithm methods?

评论 #40881553 未加载

评论 #40881979 未加载

ColinWright10 个月前

In case the author is following this:Typo:It’s worth noticing that this definition is not in general transitive: we may well have three documents A,B,C such that S(A,B)≥S_{crit} and S(B,C)≥S_{crit} but S(A,B)<S _{crit}That last S(A,B) should be S(A,C).(Unless I'm very much mistaken ... and I could be)

is_true11 个月前

I had to do this with sport teams but I used levenshtein for the names. I ended up creating a vector with other data (country, gender) and using that vector to calculate the distance (similarity).

motohagiography11 个月前

there was a really simple string sort by similarity oneliner I can't find in my history, but the basic idea was to take a wordlist, split each word in half, reverse the halves, concatenate them back together, sort the resulting strings, then flip the strings back to their original words.this simple shuffle was a poor man's similarity sort that I used to find sound-alike words. the thing about the quality of similarity is that it's hard to know for sure if it worked, but this was sufficient.

ptspts10 个月前

Do the minhash-based algorithms in the article give exactly the same results (but faster) as pairwise calculation of J(a, b), or are the minhash-based results approximate?

评论 #40881436 未加载