In the past we built and operated Greece’s largest search engine(Trinity), and we would crawl/refresh all Greek pages fairly regularly.<p>If memory serves, the frequency was computed for clusters of pages from the same site, and it depended on how often they were updated(news sites front-pages were in practice different in successive updates, whereas e.g users homepage were not, they rarely were updated), and how resilient the sites were to aggressive indexing (if they ‘d fail or timeout, or it ‘d take longer than expected to download the page contents than what we expected based on site-wide aggregated metrics, we ‘d adjust the frequency, etc).<p>The crawlers were all draining multiple queues, whereas URLs from the same site would always end up on the same queue(via consistent hashing, based on the hostname’s hash), so a single crawler process was responsible for throttling requests and respecting robots.txt rules for any single site, without need for cross-crawler state synchronisation.<p>In practice this worked quite well. Also, this was before Google and its PageRank and social networks (we ‘d probably have also considered pages popularity based on PageRank like metrics and social ‘signals’ in the frequency computation, among other variables).
In my experience the best way to crawl in a polite way is to never use an asynchronous crawler. The vast majority of small to medium sites out there have absolutely no kind of protection from an aggressive crawler. You make 50 to 100 requests per second chances are you’re DDoS-ing the shit out of most sites.<p>As for robots.txt problem is most sites don’t even have one. Especially e-commerce sites. They also don’t have a sitemap.xml in case you don’t want to hit every url just to find the structure of the site. Being polite in many cases takes a considerable effort.
See also Tuesday's HN discussion on the ethics of data scraping (<a href="https://news.ycombinator.com/item?id=12345952" rel="nofollow">https://news.ycombinator.com/item?id=12345952</a>), in which Hacker News is <i>completely split</i> on whether data scraping is ethical even if the Terms of Service <i>explicitly forbids it</i>.
Reading the previous thread again, I suppose that many of those against scraping didn't realized they've already lost : with Ghost, Phantom, and now headless Chrome you're going to have a hard time to detect a well built scraper.<p>Instead of fighting against scrapers that don't want to harm you, maybe it's about time to invest in your robots.txt and cooperate.<p>You could say that scraping you're website is FORBIDDEN, but come on : if Airbnb can rent houses, I can scrap you site.
I worked on a research project to develop a web-scale "google" for scientific data and we found very interesting things on robots.txt, from "don't crawl us" to "crawl 1 page every other day" or even better "don't crawl unless you're google".<p>Another thing we noticed is that google's crawler is kind of aggressive, I guess they are in a position to do it.<p>Our paper in case someone is interested: Optimizing Apache Nutch for domain specific crawling at large scale (<a href="http://ieeexplore.ieee.org/document/7363976/?arnumber=7363976" rel="nofollow">http://ieeexplore.ieee.org/document/7363976/?arnumber=736397...</a>)
The current protocols promote data exchange and since websites are primarily designed to be consumed, there is really no way to stop automated requests. Even companies like distilli[1] networks that parse inflight requests have trouble stopping any sufficiently motivated outfit.<p>I think data should be disseminated and free info exchange is great. If possible, devs should respect website owners as much as possible; although in my experience people seem to be more willing to rip off large "faceless" sites rather than mom&&pops. Both because that is where valuable data is, and it seems more justifiable even if morally gray.<p>Regardless, the thing I find most interesting is that Google is most often criticized for selling user data/out their users privacy. However, it is oft not mentioned that Googlebot & the army of chrome browsers are not only permitted, but encouraged to crawl all sites except a scant few that gave achieved escape velocity. Sites that wish to protect their data must disallow and forcibly stop most crawlers except google, otherwise they will be unranked. This creates an odd dichotomy where not only does google retain massive leverage, but another search engine or aggregator has more hurdles and less resources to compete.<p>[1] They protect crunchbase and many media companies.
If you're worried about being a pain in the ass to administrators, with a web-scraper, they probably need to rethink the way they have their website set up.
An alternative to Scrapinghub: PhantomJsCloud.com<p>It's a bit more "raw" than Scrapinghub but full featured and cheap.<p>Disclaimer: I'm the author!