TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Apply machine learning to fuzzy matching?

1 点作者 TXV大约 8 年前
I would like to receive some input from the wise guys here on this topic.<p>So let&#x27;s say that I have an MDM system (Master Data Management), whose primary application is to detect and prevent duplication of records.<p>Every time a sales rep enters a new customer in the system, my platform runs a check on existing records, computes the Levenshtein or Jaccard or XYZ distance between pair of words or phrases or attributes, considers weights and coefficients and outputs a similarity score, and so on.<p>Your typical fuzzy matching scenario.<p>I would like to know if it makes sense at all to apply machine learning techniques to optimize the matching output, i.e. find duplicates with maximum accuracy. And where exactly it makes the most sense.<p>- optimizing the weights of the attributes?<p>- increase the algorithm confidence by predicting the outcome of the match?<p>- learn the matching rules that otherwise I would configure into the algorithm?<p>- something else?<p>Also my understanding is that weighted fuzzy matching is already a good enough solution, probably even from a financial perspective, since whenever you deploy such an MDM system you have to do some analysis and preprocessing anyway, be it either manually encoding the matching rules or training an ML algorithm.<p>So I&#x27;m not sure that the addition of ML would represent a significant value proposition.<p>Any thoughts are appreciated.

1 comment

papaf大约 8 年前
I have no experience with fuzzy matching but it sounds a lot like finding Nearest Neighbours to me. It could be techniques used with Nearest Neighbours also help with your problem:<p>Multidimensional scaling is a nice way to visualise distances between objects: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Multidimensional_scaling" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Multidimensional_scaling</a><p>A review of Nearest Neighour techniques might be interesting: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1007.0085" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1007.0085</a><p>Also, if you stick with just a similarity score, something like a ROC curve can help to pick a cutoff: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Receiver_operating_characteristic" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Receiver_operating_characteris...</a>