TechEcho

6 comments

jpabout 14 years ago

Pretending to be human is problematic if the server thinks you are a robot because of User-Agent, IP subnet (dynamic IP cloud systems) and DNS look-up patterns (CNN and similar sites).<p>So "behaving like a human" on HN might result in an IP ban because /x is denied in robots.txt. And this gets really funny when you get banned randomly because of dynamic IP addresses in cloud infrastructure.

hungabout 14 years ago

Caching is nice, but HTTP has a built-in method: conditional GETs. I wrote up a blog post on how to do this with App Engine but it should work generally in Python using urllib2.<p><a href="http://www.hung-truong.com/blog/2010/12/01/conditional-gets-in-app-engine/" rel="nofollow">http://www.hung-truong.com/blog/2010/12/01/conditional-gets-...</a>

runningdogxabout 14 years ago

Screen scraping is taking visual data and transforming it into structure data. A screen scraper would graphically capture a window and try to identify or pick out data. Bots for MMOs tend to do that, alnong with providing input to the MMO depending on what they "see".<p>Web or data scraping is what the article talks about. Still a hard problem, easily broken by minor changes to the scraped webpage, but not subject to the vagaries of OCR and computer vision or graphical interpretation problems, which is what I was expecting from the title.

评论 #2465612 未加载

评论 #2465421 未加载

eliabout 14 years ago

No mention of observing robots.txt?

评论 #2463822 未加载

评论 #2463804 未加载

storborgabout 14 years ago

The author makes some great suggestions, namely to cache heavily and throttle requests. However, they lost a lot of credibility for me with "screen scraper traffic should be indistinguishable from human traffic". Sorry, but that's BS--socially responsible scraping leaves control with the publisher. If the publisher doesn't want you scraping their content, you shouldn't try to fake a human in order to be able to.

评论 #2464090 未加载

dhruvbirdabout 14 years ago

I can't help but mention that you should probably be using node.js with the jsdom module for such a task these days. You can get the complete power of jQuery with jsdom, making screen scraping child's play

评论 #2464131 未加载

评论 #2464534 未加载

评论 #2465181 未加载

评论 #2464605 未加载

6 comments

jpabout 14 years ago

hungabout 14 years ago

runningdogxabout 14 years ago

评论 #2465612 未加载

评论 #2465421 未加载

eliabout 14 years ago

No mention of observing robots.txt?

评论 #2463822 未加载

评论 #2463804 未加载

storborgabout 14 years ago

评论 #2464090 未加载

dhruvbirdabout 14 years ago

评论 #2464131 未加载

评论 #2464534 未加载

评论 #2465181 未加载

评论 #2464605 未加载

An Introduction to Compassionate Screen Scraping

6 comments

An Introduction to Compassionate Screen Scraping

6 comments