科技回声

I made an app to fuzzy-deduplicate my Google Sheets and CRM records- No manual configuration required- Works out-of-the-box on most data types (ex. people, companies, product catalog)Implementation details:- Embeds records using an E5-family model- Performs similarity search using DuckDB w/ vector similarity extension- Does last-mile comparison and merges duplicates using ClaudeDemo video: <a href="https://youtu.be/7mZ0kdwXBwM" rel="nofollow">https://youtu.be/7mZ0kdwXBwM</a>Github repo (Apache 2.0 licensed): <a href="https://github.com/SnowPilotOrg/dedupe_it">https://github.com/SnowPilotOrg/dedupe_it</a>Background story: My company has a table for tracking leads, which includes website visitors, demo form submissions, app signups, and manual entries. It’s full of duplicates. And writing formulas to merge those dupes has been a massive PITA.I figured that an LLM could handle any data shape and give me a way to deal with tricky custom rules like “treat international subsidiaries as distinct from their parent company”.The challenging thing was avoiding an NxN comparison matrix. The solution I came up with was first narrowing down our search space using vector embeddings + semantic similarity search, and then using a generative LLM only to compare a few nearest neighbors and merge.Some cool attributes of this approach:- Can work incrementally (no reprocessing the entire dataset)- Allows processing all records in parallel- Composes with deterministic dedupe rulesLmk any feedback on how to make this better!

2 条评论

K0IN6 个月前

This is very interesting, i was building something similar, but i used <a href="https://github.com/K0IN/string-embed">https://github.com/K0IN/string-embed</a> (embeddings based on a distance function - Levenshtein in my case) to generate embeddings, for deterministic matching.I will follow your project, im interested in your ann search speeds :)

评论 #42064782 未加载

DigiFreeze6 个月前

high-key useful! Are you thinking of making a Google Sheets extension? How are you thinking about data privacy? Any plans to make a local-only app?

评论 #42064814 未加载

2 条评论

K0IN6 个月前

评论 #42064782 未加载

DigiFreeze6 个月前

high-key useful! Are you thinking of making a Google Sheets extension? How are you thinking about data privacy? Any plans to make a local-only app?

评论 #42064814 未加载

Show HN: Fuzzy deduplicate any CSV using vector embeddings

2 条评论

Show HN: Fuzzy deduplicate any CSV using vector embeddings

2 条评论