科技回声

3 条评论

dsalaj超过 2 年前

I have been working on an entity matching solution for two years now, and I have decided to write down some of the learning I picked up along the way. Turns out there are too many relevant details to cover in a single post, so I will cover the topic in multiple parts.<p>This first part is the high-level introduction, useful for project planning and architecture decisions that need to be made early in the development process. Any feedback is welcome, along with wishes for the follow-up parts if you have something specific that you would like to be covered.

评论 #33587268 未加载

评论 #33579118 未加载

fzliu超过 2 年前

I'm surprised to see that ML-based semantic search is barely touched on in this article. There's a strong focus on entity matching, but an arguably more powerful way to conduct similarity search is to leverage embedding vectors from trained models.<p>A great upside to this approach is that it works for a variety of different types of unstructured data (images, video, molecular structures, geospatial data, etc), not just text. The rise of multimodal models such as CLIP (<a href="https://openai.com/blog/clip" rel="nofollow">https://openai.com/blog/clip</a>) makes this even more relevant today. Combine it with a vector database such as Milvus (<a href="https://milvus.io" rel="nofollow">https://milvus.io</a>) and you'll be able to do this at scale with very minimal effort.

评论 #33579926 未加载

评论 #33582139 未加载

评论 #33589172 未加载

dang超过 2 年前

I would like to know if any of these techniques could be used for identifying articles that are either copies of each other, or near-copies, or different articles on the same story.

评论 #33579007 未加载

评论 #33579506 未加载

评论 #33578845 未加载

评论 #33579155 未加载

3 条评论

dsalaj超过 2 年前

评论 #33587268 未加载

评论 #33579118 未加载

fzliu超过 2 年前

评论 #33579926 未加载

评论 #33582139 未加载

评论 #33589172 未加载

dang超过 2 年前

I would like to know if any of these techniques could be used for identifying articles that are either copies of each other, or near-copies, or different articles on the same story.

评论 #33579007 未加载

评论 #33579506 未加载

评论 #33578845 未加载

评论 #33579155 未加载

Similarity search and deduplication at scale

3 条评论

Similarity search and deduplication at scale

3 条评论