Hey HN Crew!<p>We all have lists...and they can be annoying to de-duplicate.<p>* User feedback
* Groceries
* Employee Surveys
* Bug reports
* You name it<p>Most ways to consolidate like-items work off of keywords or worse, exact phrases (Sheets/Excel).<p>But LLMs are much better at understanding an items semantic meaning and determining if two items should be combined or not.<p>I decided to build my first python package, The Semantic Deduplicator, to help me consolidate items based on their meaning, not keywords.<p>For Example On Groceries:
['We need more berries', 'I want more more milk', 'Can we get more carbonated water please?', 'We need more sparkling water']
...deduplicated...
['Berries', 'Milk', 'Sparkling Water']<p>How it works:<p>1. Start with an empty list ready to populate<p>2. The first item you add will get 1) transformed into a clean name (user feedback > product request) and 2) added to the list<p>3. While you're adding more items<p>* Check to see if your new item's embedding is close to any existing item<p>* If so, ask the LLM to compare your two items to see if they should be combined<p>* If so, combine them<p>This package is more of an exploration and POC so be careful with it. I'd love to hear any feedback.<p>All the links:<p>* YT Explainer Video: https://www.youtube.com/watch?v=etLsNgkGbeM<p>* Twitter Thread: https://twitter.com/GregKamradt/status/1719760658936545336<p>* Pypi: https://pypi.org/project/semantic-deduplicator/<p>* Github: https://github.com/gkamradt/SemanticDeduplicator
This is smart and solid work.<p>We had the same idea and made it a core product feature - <a href="https://docs.arguflow.ai/duplicate_detection" rel="nofollow noreferrer">https://docs.arguflow.ai/duplicate_detection</a>