TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Crawl a website with scrapy and store extracted results with MongoDB

93 pointsby BaltoRouberolabout 13 years ago

5 comments

JackCabout 13 years ago
For really quick one-off scraping, httplib2+lxml+PyQuery is a pretty neat combination:<p><pre><code> import httplib2, lxml, pyquery h = httplib2.Http(".cache") def get(url): resp, content = h.request( url, headers={'cache-control':'max-age=3600'}) return pyquery.PyQuery( lxml.etree.HTML(content) ) </code></pre> This gives you a little function that fetches any URL as a jquery-like object:<p><pre><code> pq = get("http://foo.com/bar") checkboxes = pq('form input[type=checkbox]') nextpage = pq('a.next').attr('href') </code></pre> And of course all of the requests are cached using whatever cache headers you want, so repeated requests will load instantly as you iterate.<p>Just something else to throw in the toolbelt ...
评论 #3879743 未加载
jat1about 13 years ago
Also check this out for a pretty good discussion on scraping <a href="http://pyvideo.org/video/609/web-scraping-reliably-and-efficiently-pull-data" rel="nofollow">http://pyvideo.org/video/609/web-scraping-reliably-and-effic...</a>
评论 #3879420 未加载
danneuabout 13 years ago
Here's the same functionality written in Ruby using Chris Kite's crawler called Anemone[1]. Gist: <a href="https://gist.github.com/2475824" rel="nofollow">https://gist.github.com/2475824</a>. Screenshot: <a href="http://i.imgur.com/cbv9A.png" rel="nofollow">http://i.imgur.com/cbv9A.png</a><p>[1]: <a href="http://anemone.rubyforge.org/doc/index.html" rel="nofollow">http://anemone.rubyforge.org/doc/index.html</a>
ananthrkabout 13 years ago
Cool. BTW, is there a reason for naming the file "isullshit_spiders.py" and not as "is<i>b</i>ullshit_spiders.py"? :)
评论 #3879627 未加载
hack_eduabout 13 years ago
I really want to read. Topic is right down my alley.<p>Unfortunately, the page is literally broken and unreadable on Android ICS with Chrome :(
评论 #3879275 未加载
评论 #3879195 未加载
评论 #3881961 未加载
评论 #3879075 未加载