科技回声

3 条评论

kartoolOz超过 2 年前

I recently had to solve the Entity Resolution Problem at my workplace, and here was how i went about it.<p>Problem Statement:<p><pre><code> - Find the right entity for a query among ~50m entities. - Queries could be few of the entity attributes (If entities have n attributes, query can in 1..n) - Queries can mention attributes in various ways (Partial information, Typing errors, abbreviation, Extra information etc) </code></pre> Existing Solution:<p><pre><code> - Elastic search based match, using complicated heuristics overfit on a small training set. Gets worse over time as number of entities increase, the top-20 search retrieval accuracy was around ~40% on current number of entities. </code></pre> Implemented Solution:<p><pre><code> - Embedding search using a Sentence embedding model (pretrained Deberta finetuned for current problem) trained via Contrastive Learning, where positive pairs are generated using augmentations for each attribute which best mock the queries (after going through many user queries) - Top-20 accuracy was around 98%, filtering out right entity was through hueristics and other business logic with proper confidence measure (hyper-param tuned on val set), after final pipeline we could get the high confidence top-1 accuracy to around 99.995% (precision) and 86% (recall). </code></pre> We ended up going with pinecone for the embedding search, and the search latency was around ~100ms (top 50 among ~50m embeddings)

评论 #32827896 未加载

shaqbert超过 2 年前

The main impediment to companies adopting entity resolution tech is the incentive structure. Companies want to show growing user numbers, transactions, leads, order, etc. Alas if you look closely and sift out the dupes/frauds, your growth looks a lot less expressive. So why look closely?

评论 #32823491 未加载

maxdemarzi超过 2 年前

How are you storing and querying your graph data? Are you using a graph database underneath all that or ?

评论 #32824740 未加载

3 条评论

kartoolOz超过 2 年前

评论 #32827896 未加载

shaqbert超过 2 年前

评论 #32823491 未加载

maxdemarzi超过 2 年前

How are you storing and querying your graph data? Are you using a graph database underneath all that or ?

评论 #32824740 未加载

Entity Resolution: Reflections on the most common data science challenge

3 条评论

Entity Resolution: Reflections on the most common data science challenge

3 条评论