TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Entity Resolution: The most common data science challenge

85 pointsby jonponalmost 3 years ago

8 comments

smeethalmost 3 years ago
Danger awaits all ye who enter this tutorial and have large datasets.<p>The tutorial is fun marketing material and all but its FAR too slow to be used on anything at scale. Please, for your sanity, don&#x27;t treat this as anything other than a fun toy example.<p>ER wants to be an an O(n^2) problem and you have to fight very hard for it not to turn into one. Most people doing this at scale are following basically the same playbook:<p>1) Use a very complicated speed optimized non-ml algorithm to find groups of entities that are highly likely to be the same, usually based on string similarity, hashing, or extremely complicated heuristics. This process is called blocking or filtering.<p>2) Use fancy ML to determine matches within these blocks.<p>If you try to combine 1 &amp; 2, skip 1, or try to do 1 with ML on large datasets you are guaranteed to have a bad time. The difference between mediocre and amazing ER is how well you are doing blocking&#x2F;filtering.
评论 #31586304 未加载
评论 #31586539 未加载
评论 #31586011 未加载
评论 #31592206 未加载
评论 #31586156 未加载
ropeladderalmost 3 years ago
Entity resolution&#x2F;record linkage&#x2F;deduplication is an oddly specialized domain of knowledge given that it&#x27;s such a common problem. I put together a page of resources a while back if anyone is interested: <a href="https:&#x2F;&#x2F;github.com&#x2F;ropeladder&#x2F;record-linkage-resources" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ropeladder&#x2F;record-linkage-resources</a>
评论 #31587640 未加载
评论 #31590003 未加载
stevesimmonsalmost 3 years ago
A good starting point for Entity Resolution&#x2F;Deduplication is the Python Dedupe project [1, 2] and the PhD thesis on whose work it is based [3]<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;dedupeio&#x2F;dedupe" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dedupeio&#x2F;dedupe</a><p>[2] <a href="https:&#x2F;&#x2F;dedupe.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;dedupe.io&#x2F;</a><p>[3] <a href="http:&#x2F;&#x2F;www.cs.utexas.edu&#x2F;~ml&#x2F;papers&#x2F;marlin-dissertation-06.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cs.utexas.edu&#x2F;~ml&#x2F;papers&#x2F;marlin-dissertation-06.p...</a>
visargaalmost 3 years ago
I was expecting the use of sentence-transformers (sbert.net). If you have a long list of entities you could use an approximate similarity search library such as annoy. The authors store the embeddings in a database and decode json for each comparison. Very inefficient in my opinion. At least load the whole table of embeddings in a np.array from the start, np.dot is plenty fast if your list not huge.<p>The problem is still not solved. Having a list of the most similar entities does not tell you which are similar and which are just related. You need a classifier. For that you can label a few hundred pairs of positive and negative examples and use the same sbert.net to finetune a transformer. The authors use the easier route of thresholding cosine similarity score at 0.8, but this threshold might not work for your case.
mynegationalmost 3 years ago
I remember helping my little sister who got entity resolution (people’s names and company names) homework assignment for programming class 26 years ago (she is economics major and I am CS). That was infuriating and intellectually challenging at the same time. We came up with a combination of n-grams, Levenshtein distance, and common abbreviation (think “Inc.” and “Corp.”) canonicalization. It worked reasonably well.
评论 #31590007 未加载
benmannsalmost 3 years ago
What are people doing with entity resolution&#x2F;record linkage? At Doximity we use it to match messy physician, hospital, and medical publication data from various sources into more coherent data sets to power profiles and research tools. Mostly with <a href="https:&#x2F;&#x2F;dedupe.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;dedupe.io&#x2F;</a> but with some custom tooling to handle larger (1m+ entities) datasets.
评论 #31588661 未加载
fastaguy88almost 3 years ago
Sorry to post on at topic I know nothing about.<p>To me, this looks very similar to local sequence similarity search (e.g. BLAST), where there are very rapid methods that use tuple-lookup and banded alignment to quickly identify &quot;homologs&quot; (the same entity). The nice thing about similarity searching algorithms is that they give you a very accurate probability of whether two strings are &quot;homologous&quot; (belong to the same entity). Perhaps I have the scale wrong, but it is routine to look for thousands of queries (new entities) among hundreds of millions of sequences (known entities) in an hour or so (and sequences are typically an order of magnitude longer than entity names). The problem is embarrassingly parallel, and very efficient algorithms are available for entities that are usually very similar.
评论 #31590021 未加载
markjspiveyalmost 3 years ago
this doesn&#x27;t make any sense though, your write up starts with an assumption that multiple records in the dataset are &quot;obviously&quot; the same entity ... so we wouldn&#x27;t even need entity resolution then ...<p>&quot;entity resolution&quot; as a process should moreso imply how to determine whether two records that aren&#x27;t obviously the same entity are actually the same entity ... and how you would go about discovering and proving and declaring that ...