Pretending to be human is problematic if the server thinks you are a robot because of User-Agent, IP subnet (dynamic IP cloud systems) and DNS look-up patterns (CNN and similar sites).<p>So "behaving like a human" on HN might result in an IP ban because /x is denied in robots.txt. And this gets really funny when you get banned randomly because of dynamic IP addresses in cloud infrastructure.
Caching is nice, but HTTP has a built-in method: conditional GETs. I wrote up a blog post on how to do this with App Engine but it should work generally in Python using urllib2.<p><a href="http://www.hung-truong.com/blog/2010/12/01/conditional-gets-in-app-engine/" rel="nofollow">http://www.hung-truong.com/blog/2010/12/01/conditional-gets-...</a>
Screen scraping is taking visual data and transforming it into structure data. A screen scraper would graphically capture a window and try to identify or pick out data. Bots for MMOs tend to do that, alnong with providing input to the MMO depending on what they "see".<p>Web or data scraping is what the article talks about. Still a hard problem, easily broken by minor changes to the scraped webpage, but not subject to the vagaries of OCR and computer vision or graphical interpretation problems, which is what I was expecting from the title.
The author makes some great suggestions, namely to cache heavily and throttle requests. However, they lost a lot of credibility for me with "screen scraper traffic should be indistinguishable from human traffic". Sorry, but that's BS--socially responsible scraping leaves control with the publisher. If the publisher doesn't want you scraping their content, you shouldn't try to fake a human in order to be able to.
I can't help but mention that you should probably be using node.js with the jsdom module for such a task these days. You can get the complete power of jQuery with jsdom, making screen scraping child's play