TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Open source focused crawler?

6 pointsby cookerwareover 11 years ago
Is there an open source crawler&#x2F;library that will recursively follow only links under a certain xpath and ignore the rest?<p>I don&#x27;t want to do an exhaustive crawl of every single link, I want something that will only follow links under a main content area.

3 comments

sherazover 11 years ago
I highly recommend Scrapy (<a href="http://www.scrapy.org" rel="nofollow">http:&#x2F;&#x2F;www.scrapy.org</a>).<p>From their site:<p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
techaddict009over 11 years ago
Check this out : <a href="http://commoncrawl.org/" rel="nofollow">http:&#x2F;&#x2F;commoncrawl.org&#x2F;</a><p>Its not exactly what you are looking for but might help you.
forkrulassailover 11 years ago
Have you tried BeautifulSoup?