TechEcho

5 comments

JackCabout 13 years ago

For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat combination:<pre><code> import httplib2, lxml, pyquery h = httplib2.Http(".cache") def get(url): resp, content = h.request( url, headers={'cache-control':'max-age=3600'}) return pyquery.PyQuery( lxml.etree.HTML(content) ) </code></pre> This gives you a little function that fetches any URL as a jquery-like object:<pre><code> pq = get("http://foo.com/bar") checkboxes = pq('form input[type=checkbox]') nextpage = pq('a.next').attr('href') </code></pre> And of course all of the requests are cached using whatever cache headers you want, so repeated requests will load instantly as you iterate.Just something else to throw in the toolbelt ...

评论 #3879743 未加载

jat1about 13 years ago

Also check this out for a pretty good discussion on scraping <a href="http://pyvideo.org/video/609/web-scraping-reliably-and-efficiently-pull-data" rel="nofollow">http://pyvideo.org/video/609/web-scraping-reliably-and-effic...</a>

评论 #3879420 未加载

danneuabout 13 years ago

Here's the same functionality written in Ruby using Chris Kite's crawler called Anemone[1]. Gist: <a href="https://gist.github.com/2475824" rel="nofollow">https://gist.github.com/2475824</a>. Screenshot: <a href="http://i.imgur.com/cbv9A.png" rel="nofollow">http://i.imgur.com/cbv9A.png</a>[1]: <a href="http://anemone.rubyforge.org/doc/index.html" rel="nofollow">http://anemone.rubyforge.org/doc/index.html</a>

ananthrkabout 13 years ago

Cool. BTW, is there a reason for naming the file "isullshit_spiders.py" and not as "isbullshit_spiders.py"? :)

评论 #3879627 未加载

hack_eduabout 13 years ago

I really want to read. Topic is right down my alley.Unfortunately, the page is literally broken and unreadable on Android ICS with Chrome :(

评论 #3879275 未加载

评论 #3879195 未加载

评论 #3881961 未加载

评论 #3879075 未加载

5 comments

JackCabout 13 years ago

评论 #3879743 未加载

jat1about 13 years ago

评论 #3879420 未加载

danneuabout 13 years ago

ananthrkabout 13 years ago

Cool. BTW, is there a reason for naming the file "isullshit_spiders.py" and not as "isbullshit_spiders.py"? :)

评论 #3879627 未加载

hack_eduabout 13 years ago

I really want to read. Topic is right down my alley.Unfortunately, the page is literally broken and unreadable on Android ICS with Chrome :(

评论 #3879275 未加载

评论 #3879195 未加载

评论 #3881961 未加载

评论 #3879075 未加载

Crawl a website with scrapy and store extracted results with MongoDB

5 comments

Crawl a website with scrapy and store extracted results with MongoDB

5 comments