TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The Science of Crawl, Part 1: Deduplication of Web Content

66 点作者 jisaacso超过 10 年前

4 条评论

stephen_mcd超过 10 年前
Great article.<p>I went through a similar process about a year ago for <a href="https://kouio.com" rel="nofollow">https:&#x2F;&#x2F;kouio.com</a> (RSS reader). In its case I needed to coalesce closely matching RSS feeds purely for storage and performance. After trialling edit distance and various simhash implementations in Python, we ended up needing to look no further than the standard library&#x27;s difflib.SequenceMatcher - I wish I documented my findings at the time, but I recall it was the best in terms of speed and accuracy.<p>Also you might not want to rely on str.isalnum for stripping punctuation. I made the same mistake here: <a href="https://twitter.com/stephen_mcd/status/506344236531212288" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;stephen_mcd&#x2F;status&#x2F;506344236531212288</a>
评论 #8271679 未加载
评论 #8277758 未加载
thaumaturgy超过 10 年前
There&#x27;s also nilsimsa hashing (there&#x27;s a Python implementation at <a href="http://code.google.com/p/py-nilsimsa/" rel="nofollow">http:&#x2F;&#x2F;code.google.com&#x2F;p&#x2F;py-nilsimsa&#x2F;</a>). Unfortunately, nilsimsa hashes can vary in their most significant bits when used on similar inputs:<p><pre><code> 773e2df0a02a319ec34a0b71d54029111da90838cbc20ecd3d2d4e18c25a3025 47182cf0802a11dec24a3b75d5042d310ca90838c9d20ecc3d610e98560a3645 </code></pre> ...so although nilsimsa is somewhat nice for calculating the difference of two documents, it&#x27;s a pain in the butt for finding similar documents in a database.<p>The solution described in the writeup is neat, but I really wish there was a LSH that generated hashes with a most-to-least significance in their bits.<p>Great writeup though!
boynamedsue超过 10 年前
As an aside: util.clean_html() has been dropped from NLTK 3.0 which has substantial API changes[1].<p>The recommendation is to now use BeautifulSoup or something similar.<p>[1] <a href="https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;nltk&#x2F;nltk&#x2F;wiki&#x2F;Porting-your-code-to-NLTK-...</a>
shabinesh超过 10 年前
Good article. I had a challenge of deduplicating addresses-I had just used cosine similarity , which just worked well for the purpose.