I'm working on a product that requires a fair amount of web crawling/monitoring and data extraction. I've looked into existing services like 80legs as well as open source software such as Apache Nutch and Scrapy, but I haven't been able to find anything that fits my purpose. It seems like some startups are building their own solutions (http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/) and I'm currently considering something similar.<p>Some things I need:<p>- Regex URL filters / follow rules<p>- Distributed & dynamically scalable<p>- Ability to export data into a variety of formats/systems (HDFS, Hbase, Elasticsearch, S3) with little overhead<p>- State maintained across crawls<p>- Continuous crawl ability, based on URL and change history<p>- Duplicate detection, perhaps using LSH+Minhash<p>- Cost-effectiveness<p>Are there any projects or services out there that I should be aware of?