TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How to Crawl the Web Politely with Scrapy

139 pointsby stummjrover 8 years ago

8 comments

markpapadakisover 8 years ago
In the past we built and operated Greece’s largest search engine(Trinity), and we would crawl&#x2F;refresh all Greek pages fairly regularly.<p>If memory serves, the frequency was computed for clusters of pages from the same site, and it depended on how often they were updated(news sites front-pages were in practice different in successive updates, whereas e.g users homepage were not, they rarely were updated), and how resilient the sites were to aggressive indexing (if they ‘d fail or timeout, or it ‘d take longer than expected to download the page contents than what we expected based on site-wide aggregated metrics, we ‘d adjust the frequency, etc).<p>The crawlers were all draining multiple queues, whereas URLs from the same site would always end up on the same queue(via consistent hashing, based on the hostname’s hash), so a single crawler process was responsible for throttling requests and respecting robots.txt rules for any single site, without need for cross-crawler state synchronisation.<p>In practice this worked quite well. Also, this was before Google and its PageRank and social networks (we ‘d probably have also considered pages popularity based on PageRank like metrics and social ‘signals’ in the frequency computation, among other variables).
评论 #12361138 未加载
评论 #12361012 未加载
elorantover 8 years ago
In my experience the best way to crawl in a polite way is to never use an asynchronous crawler. The vast majority of small to medium sites out there have absolutely no kind of protection from an aggressive crawler. You make 50 to 100 requests per second chances are you’re DDoS-ing the shit out of most sites.<p>As for robots.txt problem is most sites don’t even have one. Especially e-commerce sites. They also don’t have a sitemap.xml in case you don’t want to hit every url just to find the structure of the site. Being polite in many cases takes a considerable effort.
评论 #12358778 未加载
评论 #12358777 未加载
评论 #12360237 未加载
评论 #12360062 未加载
minimaxirover 8 years ago
See also Tuesday&#x27;s HN discussion on the ethics of data scraping (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12345952" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=12345952</a>), in which Hacker News is <i>completely split</i> on whether data scraping is ethical even if the Terms of Service <i>explicitly forbids it</i>.
评论 #12359523 未加载
评论 #12359538 未加载
评论 #12358768 未加载
tangueover 8 years ago
Reading the previous thread again, I suppose that many of those against scraping didn&#x27;t realized they&#x27;ve already lost : with Ghost, Phantom, and now headless Chrome you&#x27;re going to have a hard time to detect a well built scraper.<p>Instead of fighting against scrapers that don&#x27;t want to harm you, maybe it&#x27;s about time to invest in your robots.txt and cooperate.<p>You could say that scraping you&#x27;re website is FORBIDDEN, but come on : if Airbnb can rent houses, I can scrap you site.
评论 #12359682 未加载
评论 #12359720 未加载
betolinkover 8 years ago
I worked on a research project to develop a web-scale &quot;google&quot; for scientific data and we found very interesting things on robots.txt, from &quot;don&#x27;t crawl us&quot; to &quot;crawl 1 page every other day&quot; or even better &quot;don&#x27;t crawl unless you&#x27;re google&quot;.<p>Another thing we noticed is that google&#x27;s crawler is kind of aggressive, I guess they are in a position to do it.<p>Our paper in case someone is interested: Optimizing Apache Nutch for domain specific crawling at large scale (<a href="http:&#x2F;&#x2F;ieeexplore.ieee.org&#x2F;document&#x2F;7363976&#x2F;?arnumber=7363976" rel="nofollow">http:&#x2F;&#x2F;ieeexplore.ieee.org&#x2F;document&#x2F;7363976&#x2F;?arnumber=736397...</a>)
评论 #12360800 未加载
vonklausover 8 years ago
The current protocols promote data exchange and since websites are primarily designed to be consumed, there is really no way to stop automated requests. Even companies like distilli[1] networks that parse inflight requests have trouble stopping any sufficiently motivated outfit.<p>I think data should be disseminated and free info exchange is great. If possible, devs should respect website owners as much as possible; although in my experience people seem to be more willing to rip off large &quot;faceless&quot; sites rather than mom&amp;&amp;pops. Both because that is where valuable data is, and it seems more justifiable even if morally gray.<p>Regardless, the thing I find most interesting is that Google is most often criticized for selling user data&#x2F;out their users privacy. However, it is oft not mentioned that Googlebot &amp; the army of chrome browsers are not only permitted, but encouraged to crawl all sites except a scant few that gave achieved escape velocity. Sites that wish to protect their data must disallow and forcibly stop most crawlers except google, otherwise they will be unranked. This creates an odd dichotomy where not only does google retain massive leverage, but another search engine or aggregator has more hurdles and less resources to compete.<p>[1] They protect crunchbase and many media companies.
libeclipseover 8 years ago
If you&#x27;re worried about being a pain in the ass to administrators, with a web-scraper, they probably need to rethink the way they have their website set up.
novaleafover 8 years ago
An alternative to Scrapinghub: PhantomJsCloud.com<p>It&#x27;s a bit more &quot;raw&quot; than Scrapinghub but full featured and cheap.<p>Disclaimer: I&#x27;m the author!