I made an app to fuzzy-deduplicate my Google Sheets and CRM records<p>- No manual configuration required<p>- Works out-of-the-box on most data types (ex. people, companies, product catalog)<p>Implementation details:<p>- Embeds records using an E5-family model<p>- Performs similarity search using DuckDB w/ vector similarity extension<p>- Does last-mile comparison and merges duplicates using Claude<p>Demo video: <a href="https://youtu.be/7mZ0kdwXBwM" rel="nofollow">https://youtu.be/7mZ0kdwXBwM</a><p>Github repo (Apache 2.0 licensed): <a href="https://github.com/SnowPilotOrg/dedupe_it">https://github.com/SnowPilotOrg/dedupe_it</a><p>Background story: My company has a table for tracking leads, which includes website visitors, demo form submissions, app signups, and manual entries. It’s full of duplicates. And writing formulas to merge those dupes has been a massive PITA.<p>I figured that an LLM could handle any data shape and give me a way to deal with tricky custom rules like “treat international subsidiaries as distinct from their parent company”.<p>The challenging thing was avoiding an NxN comparison matrix. The solution I came up with was first narrowing down our search space using vector embeddings + semantic similarity search, and then using a generative LLM only to compare a few nearest neighbors and merge.<p>Some cool attributes of this approach:<p>- Can work incrementally (no reprocessing the entire dataset)<p>- Allows processing all records in parallel<p>- Composes with deterministic dedupe rules<p>Lmk any feedback on how to make this better!