TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What do you use for web crawling?

6 pointsby dennybritzalmost 11 years ago
I&#x27;m working on a product that requires a fair amount of web crawling&#x2F;monitoring and data extraction. I&#x27;ve looked into existing services like 80legs as well as open source software such as Apache Nutch and Scrapy, but I haven&#x27;t been able to find anything that fits my purpose. It seems like some startups are building their own solutions (http:&#x2F;&#x2F;blog.semantics3.com&#x2F;how-we-built-our-almost-distributed-web-crawler&#x2F;) and I&#x27;m currently considering something similar.<p>Some things I need:<p>- Regex URL filters &#x2F; follow rules<p>- Distributed &amp; dynamically scalable<p>- Ability to export data into a variety of formats&#x2F;systems (HDFS, Hbase, Elasticsearch, S3) with little overhead<p>- State maintained across crawls<p>- Continuous crawl ability, based on URL and change history<p>- Duplicate detection, perhaps using LSH+Minhash<p>- Cost-effectiveness<p>Are there any projects or services out there that I should be aware of?

no comments

no comments