Is there an open source crawler/library that will recursively follow only links under a certain xpath and ignore the rest?<p>I don't want to do an exhaustive crawl of every single link, I want something that will only follow links under a main content area.
I highly recommend Scrapy (<a href="http://www.scrapy.org" rel="nofollow">http://www.scrapy.org</a>).<p>From their site:<p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Check this out : <a href="http://commoncrawl.org/" rel="nofollow">http://commoncrawl.org/</a><p>Its not exactly what you are looking for but might help you.