Ask HN: Open source focused crawler?

6 点作者 cookerware超过 11 年前

Is there an open source crawler/library that will recursively follow only links under a certain xpath and ignore the rest?<p>I don't want to do an exhaustive crawl of every single link, I want something that will only follow links under a main content area.

3 条评论

sheraz超过 11 年前

I highly recommend Scrapy (<a href="http://www.scrapy.org" rel="nofollow">http://www.scrapy.org</a>).<p>From their site:<p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

techaddict009超过 11 年前

Check this out : <a href="http://commoncrawl.org/" rel="nofollow">http://commoncrawl.org/</a><p>Its not exactly what you are looking for but might help you.

forkrulassail超过 11 年前

Have you tried BeautifulSoup?