For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat combination:<p><pre><code> import httplib2, lxml, pyquery
h = httplib2.Http(".cache")
def get(url):
resp, content = h.request( url, headers={'cache-control':'max-age=3600'})
return pyquery.PyQuery( lxml.etree.HTML(content) )
</code></pre>
This gives you a little function that fetches any URL as a jquery-like object:<p><pre><code> pq = get("http://foo.com/bar")
checkboxes = pq('form input[type=checkbox]')
nextpage = pq('a.next').attr('href')
</code></pre>
And of course all of the requests are cached using whatever cache headers you want, so repeated requests will load instantly as you iterate.<p>Just something else to throw in the toolbelt ...
Also check this out for a pretty good discussion on scraping <a href="http://pyvideo.org/video/609/web-scraping-reliably-and-efficiently-pull-data" rel="nofollow">http://pyvideo.org/video/609/web-scraping-reliably-and-effic...</a>
Here's the same functionality written in Ruby using Chris Kite's crawler called Anemone[1]. Gist: <a href="https://gist.github.com/2475824" rel="nofollow">https://gist.github.com/2475824</a>. Screenshot: <a href="http://i.imgur.com/cbv9A.png" rel="nofollow">http://i.imgur.com/cbv9A.png</a><p>[1]: <a href="http://anemone.rubyforge.org/doc/index.html" rel="nofollow">http://anemone.rubyforge.org/doc/index.html</a>