TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Entity Resolution: Reflections on the most common data science challenge

60 点作者 Major_Grooves超过 2 年前

3 条评论

kartoolOz超过 2 年前
I recently had to solve the Entity Resolution Problem at my workplace, and here was how i went about it.<p>Problem Statement:<p><pre><code> - Find the right entity for a query among ~50m entities. - Queries could be few of the entity attributes (If entities have n attributes, query can in 1..n) - Queries can mention attributes in various ways (Partial information, Typing errors, abbreviation, Extra information etc) </code></pre> Existing Solution:<p><pre><code> - Elastic search based match, using complicated heuristics overfit on a small training set. Gets worse over time as number of entities increase, the top-20 search retrieval accuracy was around ~40% on current number of entities. </code></pre> Implemented Solution:<p><pre><code> - Embedding search using a Sentence embedding model (pretrained Deberta finetuned for current problem) trained via Contrastive Learning, where positive pairs are generated using augmentations for each attribute which best mock the queries (after going through many user queries) - Top-20 accuracy was around 98%, filtering out right entity was through hueristics and other business logic with proper confidence measure (hyper-param tuned on val set), after final pipeline we could get the high confidence top-1 accuracy to around 99.995% (precision) and 86% (recall). </code></pre> We ended up going with pinecone for the embedding search, and the search latency was around ~100ms (top 50 among ~50m embeddings)
评论 #32827896 未加载
shaqbert超过 2 年前
The main impediment to companies adopting entity resolution tech is the incentive structure. Companies want to show growing user numbers, transactions, leads, order, etc. Alas if you look closely and sift out the dupes&#x2F;frauds, your growth looks a lot less expressive. So why look closely?
评论 #32823491 未加载
maxdemarzi超过 2 年前
How are you storing and querying your graph data? Are you using a graph database underneath all that or ?
评论 #32824740 未加载