Ask HN: Open source focused crawler?

6 pointsby cookerwareover 11 years ago

Is there an open source crawler/library that will recursively follow only links under a certain xpath and ignore the rest?<p>I don't want to do an exhaustive crawl of every single link, I want something that will only follow links under a main content area.

3 comments

sherazover 11 years ago

I highly recommend Scrapy (<a href="http://www.scrapy.org" rel="nofollow">http://www.scrapy.org</a>).<p>From their site:<p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

techaddict009over 11 years ago

Check this out : <a href="http://commoncrawl.org/" rel="nofollow">http://commoncrawl.org/</a><p>Its not exactly what you are looking for but might help you.

forkrulassailover 11 years ago

Have you tried BeautifulSoup?