I recently had to solve the Entity Resolution Problem at my workplace, and here was how i went about it.<p>Problem Statement:<p><pre><code> - Find the right entity for a query among ~50m entities.
- Queries could be few of the entity attributes (If entities have n attributes, query can in 1..n)
- Queries can mention attributes in various ways (Partial information, Typing errors, abbreviation, Extra information etc)
</code></pre>
Existing Solution:<p><pre><code> - Elastic search based match, using complicated heuristics overfit on a small training set. Gets worse over time as number of entities increase, the top-20 search retrieval accuracy was around ~40% on current number of entities.
</code></pre>
Implemented Solution:<p><pre><code> - Embedding search using a Sentence embedding model (pretrained Deberta finetuned for current problem) trained via Contrastive Learning, where positive pairs are generated using augmentations for each attribute which best mock the queries (after going through many user queries)
- Top-20 accuracy was around 98%, filtering out right entity was through hueristics and other business logic with proper confidence measure (hyper-param tuned on val set), after final pipeline we could get the high confidence top-1 accuracy to around 99.995% (precision) and 86% (recall).
</code></pre>
We ended up going with pinecone for the embedding search, and the search latency was around ~100ms (top 50 among ~50m embeddings)