TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

An Introduction to Compassionate Screen Scraping

97 pointsby helwrabout 14 years ago

6 comments

jpabout 14 years ago
Pretending to be human is problematic if the server thinks you are a robot because of User-Agent, IP subnet (dynamic IP cloud systems) and DNS look-up patterns (CNN and similar sites).<p>So "behaving like a human" on HN might result in an IP ban because /x is denied in robots.txt. And this gets really funny when you get banned randomly because of dynamic IP addresses in cloud infrastructure.
hungabout 14 years ago
Caching is nice, but HTTP has a built-in method: conditional GETs. I wrote up a blog post on how to do this with App Engine but it should work generally in Python using urllib2.<p><a href="http://www.hung-truong.com/blog/2010/12/01/conditional-gets-in-app-engine/" rel="nofollow">http://www.hung-truong.com/blog/2010/12/01/conditional-gets-...</a>
runningdogxabout 14 years ago
Screen scraping is taking visual data and transforming it into structure data. A screen scraper would graphically capture a window and try to identify or pick out data. Bots for MMOs tend to do that, alnong with providing input to the MMO depending on what they "see".<p>Web or data scraping is what the article talks about. Still a hard problem, easily broken by minor changes to the scraped webpage, but not subject to the vagaries of OCR and computer vision or graphical interpretation problems, which is what I was expecting from the title.
评论 #2465612 未加载
评论 #2465421 未加载
eliabout 14 years ago
No mention of observing robots.txt?
评论 #2463822 未加载
评论 #2463804 未加载
storborgabout 14 years ago
The author makes some great suggestions, namely to cache heavily and throttle requests. However, they lost a lot of credibility for me with "screen scraper traffic should be indistinguishable from human traffic". Sorry, but that's BS--socially responsible scraping leaves control with the publisher. If the publisher doesn't want you scraping their content, you shouldn't try to fake a human in order to be able to.
评论 #2464090 未加载
dhruvbirdabout 14 years ago
I can't help but mention that you should probably be using node.js with the jsdom module for such a task these days. You can get the complete power of jQuery with jsdom, making screen scraping child's play
评论 #2464131 未加载
评论 #2464534 未加载
评论 #2465181 未加载
评论 #2464605 未加载