TechEcho

I'm working on a product that requires a fair amount of web crawling/monitoring and data extraction. I've looked into existing services like 80legs as well as open source software such as Apache Nutch and Scrapy, but I haven't been able to find anything that fits my purpose. It seems like some startups are building their own solutions (http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/) and I'm currently considering something similar.Some things I need:- Regex URL filters / follow rules- Distributed & dynamically scalable- Ability to export data into a variety of formats/systems (HDFS, Hbase, Elasticsearch, S3) with little overhead- State maintained across crawls- Continuous crawl ability, based on URL and change history- Duplicate detection, perhaps using LSH+Minhash- Cost-effectivenessAre there any projects or services out there that I should be aware of?

Ask HN: What do you use for web crawling?

no comments

Ask HN: What do you use for web crawling?

no comments