TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Zingg – open-source entity resolution for single source of truth

18 pointsby sonalgoyalover 3 years ago
Hello HN,<p>I am Sonal, a data consultant from India. For the past few months(and years!), I have been working on an entity resolution tool to build a single source of truth for customers, suppliers, products and parts. Here is a short demo of Zingg in action https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=zOabyZxN9b0<p>As a data consultant, I often struggled to build unified views of core entities on the datalake and the warehouse. Data spread across different systems has variations and consistencies making Customer 360, KYC, AML, segmentation, personalization and other analytics difficult.<p>As I talked with different clients facing this issue, I searched for existing solutions which I could use or recommend. Unfortunately, most of them were very expensive MDM solutions like Tamr, or CDP solutions like Amperity. There were many open source libraries, but they did not tie well into the datalake&#x2F;warehouse scenarios we were working with, did not scale and&#x2F;or needed a decent bit of programming or did not generalize. I even tried to build something internally and failed miserably, and that got me hooked :-)<p>As I dug deeper into the problem, I realized that there were multiple challenges. Data matching, at its very core, becomes a cartesian join, as you need to compare every pair of records to figure out the matches. With millions of records, this becomes extremely tough to scale. I referred to various research papers and then implemented a blocking algorithm to overcome this. More details at https:&#x2F;&#x2F;docs.zingg.ai&#x2F;docs&#x2F;zModels.html<p>The second challenge was to say which pairs are a match. I wanted to have a machine learning-based approach to handle the different types of entities and the variety of differences in real world data. But I also felt that non ML experts should be able to use Zingg easily, hence took the approach of abstracting the feature generation and hyper-parameter tuning for the classifier.<p>Once I settled on the ML approach, the problem of training data quickly arose, which led me to pick up active learning and build an interactive labeler through which sample records can be marked as matches and non matches to build training sets quickly. I still feel that we should have an unsupervised approach as well, but I have not yet figured out the right way to do so.<p>The Zingg repository is hosted at https:&#x2F;&#x2F;github.com&#x2F;zinggAI&#x2F;zingg and we have close to 60 members on our Slack(https:&#x2F;&#x2F;join.slack.com&#x2F;t&#x2F;zinggai&#x2F;shared_invite&#x2F;zt-w7zlcnol-vEuqU9m~Q56kLLUVxRgpOA). We are now two developers working full time on Zingg!!! I am super happy that early users have been able to use Zingg and push us to build more stuff - model documentation, using pre-existing training data, native Snowflake integration etc.<p>I have been an open source consumer all my dev life, and this is the first time I have made a decent contribution. It is my first time trying to build a community as well. Not sure how the future will unfold, but wanted to reach out to the community here and hear what you think about the problem, the approach, any ideas or suggestions.<p>Thanks for reading along, and please do post your thoughts in the comments below.

5 comments

rishsrivover 3 years ago
This looks pretty cool! Is this basically efficient&#x2F;scalable fuzzy object matching?<p>IMO, it would be super useful to have some performance benchmarks – how fast is this for 1k&#x2F;100k objects? How does that compare to other approaches etc<p>Not sure how feasible these are, but features I would find super useful:<p>- string matching across languages in different scripts (with something like unidecode maybe? [1])<p>- fuzzy matching that includes continuous variables like lat&#x2F;long, age etc<p>Excited about using this – will be following the repo very closely!<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;avian2&#x2F;unidecode" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;avian2&#x2F;unidecode</a>
评论 #30275019 未加载
bencastletonover 3 years ago
I&#x27;m excited about this. So needed to have an open source solution for this problem! Love it!
评论 #30276871 未加载
navinrathoreover 3 years ago
Hi, this is Navin, I work with Sonal on Zingg and would love to hear your feedback on our work.
javedevuxover 3 years ago
Looks cool. Congratulations
评论 #30274680 未加载
ruchiragarwal75over 3 years ago
This looks good. Do you have some ready models to try?
评论 #30277072 未加载