Jaccard Index

252 点作者 dedalus大约 2 年前

19 条评论

pncnmnp大约 2 年前

I recently wrote a fun blog post (<a href="https://pncnmnp.github.io/blogs/odd-sketches.html" rel="nofollow">https://pncnmnp.github.io/blogs/odd-sketches.html</a>) about how to estimate Jaccard Similarity using min hashing, what b-bit min hashing is, and how to improve upon its limitations using a 2014 data structure called odd sketches.Jaccard Similarity's history is also quite interesting. From my blog:> In the late 19th century, the United States and several European nations were focused on developing strategies for weather forecasting, particularly for storm warnings. In 1884, Sergeant John Finley of the U.S. Army Signal Corps conducted experiments aimed at creating a tornado forecasting program for 18 regions in the United States east of the Rockies. To the surprise of many, Finley claimed his programs were 95.6% to 98.6% accurate, with some areas even achieving a 100% accuracy rate. Upon publishing his findings, Finley's methods were criticized by contemporaries who pointed out flaws in his verification strategies and proposed their solutions. This sparked a renewed interest in weather prediction, which is now referred to as the "Finley Affair."> One of these contemporaries was Grove Karl Gilbert. Just two months after Finley's publication, Gilbert pointed out that, based on Finley's strategy, a 98.2% accuracy rate could be achieved simply by forecasting no tornado warning. Gilbert then introduced an alternative strategy, which is now known as Jaccard Similarity.> So why is it named Jaccard Similarity? As it turns out, nearly three decades after Sergeant John Finley's tornado forecasting program in the 1880s, Paul Jaccard independently developed the same concept while studying the distribution of alpine flora.

评论 #35225579 未加载

BenoitP大约 2 年前

As an aside if you find yourself having to compute them on the fly, know that the Roaring Bitmaps libraries is the way to go [1]. The bitmaps are compressed, and can be streamed directly into SIMD computations (batching boolean transformations and popcnts 256 bits wide!). The Jaccard index is just intersection_len / union_len [2] awayOf note: the author of that library is none other than Daniel Lemire [3], whose articles pop up quite often on HN[1] <a href="https://roaringbitmap.org/" rel="nofollow">https://roaringbitmap.org/</a>[2] <a href="https://roaringbitmap.readthedocs.io/en/latest/#roaringbitmap.RoaringBitmap.intersection_len" rel="nofollow">https://roaringbitmap.readthedocs.io/en/latest/#roaringbitma...</a>[3] <a href="https://lemire.me/blog/" rel="nofollow">https://lemire.me/blog/</a>

psyklic大约 2 年前

This is one of my favorite distance metrics* to show people!For example, perhaps one person likes Reddit and HN, while someone else likes HN and SO.Then their Jaccard Index would be 1/3, since they have one thing in common out of three.* Technically it computes "similarity" (larger number == more similar), but `1 - Jaccard Index` is a distance (smaller number == more similar).

评论 #35227958 未加载

stygiansonic大约 2 年前

The name may be an example of this: <a href="https://en.m.wikipedia.org/wiki/Stigler%27s_law_of_eponymy" rel="nofollow">https://en.m.wikipedia.org/wiki/Stigler%27s_law_of_eponymy</a>It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v)[1] and now is frequently referred to as the Critical Success Index in meteorology.[2] It was later developed independently by Paul Jaccard…

评论 #35223674 未加载

评论 #35224161 未加载

charliejuggler大约 2 年前

My colleague Nate wrote about using Jaccard for search engine regression testing: <a href="https://opensourceconnections.com/blog/2021/07/23/jaccard-index-for-search-regression-testing/" rel="nofollow">https://opensourceconnections.com/blog/2021/07/23/jaccard-in...</a> If you make changes to a search algorithm and the order of results changes too much this can alarm users, so this is a good way to reduce risk.

pfdietz大约 2 年前

I've used this to find similarity between terms in a specification document and identifiers in a program. The index is computed on the bag of bigrams from the two strings.If one has a set of pairs that are similar, one can look for common bag differences in the matches. These can correspond to extra characters inserted in the identifier names for that particular program (for example, prefixes or suffixes related to module name or variable types.) Once these are found they can be used to tweak the similarity score.

b_mc2大约 2 年前

I've used this recently to do some fuzzy matching of column names in datasets, I also added it to a small python one-liner library I've been making for practice. p.s. don't give me flack, I know this isn't an efficient way to do things.jaccard = lambda A, B: len(set(A).intersection(set(B))) / len(set(A).union(set(B)))<a href="https://github.com/b-mc2/MiniMath">https://github.com/b-mc2/MiniMath</a>

bitshiftfaced大约 2 年前

One of the weaknesses with Jaccard similarity is how it focuses on matches/true positives. It neglects the importance of "negative space."I was happy to see Matthew's correlation coefficient (MCC) used in the recent "1st and Future - Player Contact Detection" Kaggle competition. MCC balances the eight confusion matrix ratios, and I've gotten excellent results when using it in the past.

评论 #35238195 未加载

ketralnis大约 2 年前

Quoting myself from a while ago[0]At reddit many moons ago before machine learning was a buzzword one early iteration of recommendations was based on Jaccard distance using the number of co-voters between subreddits. But with one twist: divide by the size of the smaller subreddit.<pre><code> relatedness a b = numerator = | voters on(a) ∩ voters on(b) | denominator = | voters on(a) ∪ voters on(b) | weight = min(|voters on(a)|, |voters on(b)|) numerator / (weight*denominator) </code></pre> That gives you a directional relatedness, that is programming->python but not necessarily python->programming. Used this way you account for the giant subreddit problem[1] automatically but now the results are less “amitheasshole is related to askreddit” and more like “linguisticshumor is a more niche version of linguistics”.The great thing is that it’s actually more actionable as far as recommendations go! Everybody has already heard of the bigger version of this subreddit, but they probably haven’t heard of the smaller versions. And it’s self-correcting: as a subreddit gets bigger we are less likely to recommend it, which is great because it needs our help less.It's also easy to compute this because it lends itself to one giant SQL query that postgres or even sqlite[2] optimises reasonable well. It has some discontinuities around very tiny subredddits, so there was also a hack to just exclude them with a hack heuristic. It does get fairly static so once we've picked 3 subreddits to recommend if you're on subreddit A, if you don't like them we'll just keep showing them anyway. I had a hack in mind for that (use the computed values as random weights so we'll still occasionally show lower-scoring ones) but by this time people much smarter than I took over recommendations with more holistic solutions to the problem we were trying to solve in the first place. Still, as a first pass it worked great and based on my experience I'd recommend simple approaches like this before you break out the the linear algebra.Side note, I tried co-commenters in addition to co-voters. The results tended more accurate in my spot tests but the difference fell away in more proper cross-validation testing and I didn't look into where the qualitative difference was. But since there are more votes than comments on small subreddits the number of recommendable subreddits was higher with votes. I reasoned that co-submitters (of posts) should be even more accurate but it was thrown off by a small number of spammers and I didn't want to mess with combining those tasks at the time.[0]: <a href="https://news.ycombinator.com/item?id=22178517" rel="nofollow">https://news.ycombinator.com/item?id=22178517</a>[1]: that votes are distributed according to a power law, meaning that everybody has voted on the largest subreddits so most clustering approaches recommend askreddit to everybody. That's okay for product recommendations where "you should buy the most popular CPU, it's most popular for a reason" but for subreddits you already know that so we want a way to bias to the most "surprising" of your votes.[2]: I prototyped it on sqlite on my laptop and even with close to the production amount of data it ran reasonable well. Not fast, but fine. This was on considerably less traffic to today, mind.

评论 #35226391 未加载

评论 #35254730 未加载

samscully大约 2 年前

The book Mining of Massive Datasets [1] has useful information on building an efficient similarity index using Jaccard/minhash. I would also recommend Otmar Ertl's papers on extensions of minhash that approximate Jaccard better in certain situations, e.g. superminhash [2].[1] <a href="http://www.mmds.org/" rel="nofollow">http://www.mmds.org/</a> Chapter 3 [2] <a href="https://arxiv.org/abs/1706.05698" rel="nofollow">https://arxiv.org/abs/1706.05698</a>

hbcondo714大约 2 年前

I think a good example of the Jaccard Index is from the paper Lazy Prices[1] in which the authors use it to measure diffs in a company's annual and quarterly reports. They also use other similarity measures (cosine, minimum edit, simple) so you get a better understanding of how each applies to textual similarity as a whole.[1] <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1658471" rel="nofollow">https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1658471</a>

coeneedell大约 2 年前

I recently used Jaccard similarity as a measurement of distance between two sets of online articles. It’s amazing how versatile it is for all sorts of weird tasks.

评论 #35223271 未加载

startup_eng大约 2 年前

I just used this at work the other day to calculate similarities between different data models that had overlapping children models. One of our teams was going to go through manually to check these overlaps and consolidate, but by using this clustering algo based on Jaccard distance we were able to give them clusters to consolidate up front. Super cool stuff!

评论 #35223968 未加载

hughw大约 2 年前

Looks like a "reflection coefficient for sets."Reflection coefficient in electrical or acoustic (or elastic) transmission across two media is the difference of their impedances over the sum of them.Difference over sum is a pattern you see a lot.

pallas_athena大约 2 年前

I used this for a project re: similarity between two strings.The Jaccard similarity between sets of uni- and bi-grams was a surprising effective metric.DOG -> {d, o, g, do, og}GOD -> {g, o, d, go, od}intersection = {d, g, o}union = {d, g, o, do, go, od, og}J = 3 / 7 = ~43%

john-titor大约 2 年前

this is well-known to people working in cheminfornatics as the tanimoto similarity, used to calculate the similarity of two chemical structures based on their (folded) fingerprint

SubiculumCode大约 2 年前

Jaccard my Dice please.

评论 #35223344 未加载

unethical_ban大约 2 年前

What is the predicted bounding and the ground truth bounding, as related by a stop sign? I have no idea what's happening there.

评论 #35224028 未加载

评论 #35227272 未加载

seydor大约 2 年前

didnt know there was a name for it