Python web scraping

138 pointsby Jake232about 11 years ago

21 comments

Denzelabout 11 years ago

Why not just use Scrapy[1]? It's built for this sort of thing, easily extensible, and written in Python.[1] <a href="http://scrapy.org/" rel="nofollow">http://scrapy.org/</a>

评论 #7375848 未加载

评论 #7377813 未加载

评论 #7375808 未加载

评论 #7375766 未加载

评论 #7379741 未加载

JonLimabout 11 years ago

Good resource - I've been using BeautifulSoup[1] for the scraper I set up for my needs, and it's probably worth checking out as well![1]: <a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>

评论 #7375960 未加载

bryogenicabout 11 years ago

PyQuery is pretty good for navigating around the DOM too.<a href="http://pythonhosted.org//pyquery/" rel="nofollow">http://pythonhosted.org//pyquery/</a>

评论 #7375843 未加载

评论 #7376264 未加载

caio1982about 11 years ago

Great summary on how to start on the topic, really nice! I only wish it was longer as I love playing with scraping (regex lover here) and unfortunately not many people consider going straight with lxml + xpath, which is ridiculously fast. Sometimes I see people writing a bunch of lines to walk through a tree of elements and using selectors with BeautifulSoup or even with a full Scrapy project and then I, like, "dude, why didn't you just extracted that tiny data with a single xpath?". Suggestion for a future update: try covering the caveats of lxml (invalid pages [technically not lxml's fault but okay], limitations of xpath 1.0 compared to 2.0 [not supported in lxml], tricky charset detection) and maybe throw a few samples codes both in BS and lxml to compare when to use each of them :-)

评论 #7377383 未加载

gameguy43about 11 years ago

For a browser that runs JS the author mentions PhantomJS, but looks like its Python support is iffy. Mechanize is super-easy in Python: <a href="http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet" rel="nofollow">http://www.pythonforbeginners.com/cheatsheet/python-mechaniz...</a>Edit: so easy, in fact, that I prefer to just START with using mechanize to fetch the pages--why bother testing whether or not your downloader needs JS, cookies, a reasonable user agent, etc--just start with them.

评论 #7376370 未加载

评论 #7377396 未加载

aGHzabout 11 years ago

Like the OP, I needed more control over the crawling behaviour for a project. All the scraping code quickly became a mess though, so I wrote a library that lets you declaratively define the data you're interested in (think Django forms). It also provides decorators that allow you to specify imperative code for organizing the data cleanup before and after parsing. See how easy it is to extract data from the HN front page: <a href="https://github.com/aGHz/structominer/blob/master/examples/hn.py" rel="nofollow">https://github.com/aGHz/structominer/blob/master/examples/hn...</a>I'm still working on proper packaging so for the moment the only way to install Struct-o-miner is to clone it from <a href="https://github.com/aGHz/structominer" rel="nofollow">https://github.com/aGHz/structominer</a>.

评论 #7377369 未加载

rplntabout 11 years ago

> [...] prefer to stick with lxml for raw speed.Is parsing speed really an issue when you are scraping the data you are parsing?

评论 #7378628 未加载

评论 #7378538 未加载

christianmannabout 11 years ago

> Use the Network tab in Chrome Developer Tools to find the AJAX request, you'll usually be greeted by a response in json.But sometimes you won't. Sometimes you'll be assaulted with a response in a proprietary, obfuscated, or encrypted format. In situations where reverse-engineering the Javascript is unrealistic (perhaps it is equally obfuscated), I recommend Selenium[1][2] for scraping. It hooks a remote control to Firefox, Opera, Chrome, or IE, and allows you to read the data back out.[1]: <a href="http://docs.seleniumhq.org/" rel="nofollow">http://docs.seleniumhq.org/</a> [2]: <a href="http://docs.seleniumhq.org/projects/webdriver/" rel="nofollow">http://docs.seleniumhq.org/projects/webdriver/</a>

评论 #7377783 未加载

hernan604about 11 years ago

Test this real scrapper framework: <a href="https://github.com/hernan604/HTML-Robot-Scraper" rel="nofollow">https://github.com/hernan604/HTML-Robot-Scraper</a>

评论 #7378190 未加载

joyofdataabout 11 years ago

Warning: Self-Advertisment!<a href="http://www.joyofdata.de/blog/using-linux-shell-web-scraping/" rel="nofollow">http://www.joyofdata.de/blog/using-linux-shell-web-scraping/</a>Okay, it's not as bold has using Headers saying "Hire Me" but I would like to emphasize that sometimes even complex tasks can be super-easy when you use the right tools. And a combination of Linux shell tools makes this task really very straightforward (literally).

评论 #7377905 未加载

评论 #7378296 未加载

notfossabout 11 years ago

> If you need to extract data from a web page, then the chances are you looked for their API. Unfortunately this isn't always available and you sometimes have to fall back to web scraping.Also, many times, not all functionality/features are available through the API.Edit: By the way, without JS enabled, the code blocks on your website are basically unviewable (at least on Firefox).<a href="http://imgur.com/CSYxMfL" rel="nofollow">http://imgur.com/CSYxMfL</a>

victoroabout 11 years ago

This is a great starting point. Can anyone recommend any resources for how to best set up a remote scraping box on AWS or another similar provider? Pitfalls, best tools to help manage/automate scripts etc. I've found a few "getting started" tutorials like this one but I haven't been able to find anything good that discusses scraping beyond running basic scripts on your local machine.

评论 #7377265 未加载

level09about 11 years ago

I created a news aggregator completely based on a python scrapper. I run scrapping jobs as periodic celery tasks. usually I would look into the RSS feeds (to start) and parse them with "Feed Parser" module. Then I would use "pyquery" to find open graph tags.Pyquery is an excellent module, but I think its parser is not very "Forgiving" so it might fail for some invalid/non-standard markups.

评论 #7377817 未加载

hatchooabout 11 years ago

The part on how to avoid detection was particularly useful for me.I use webscraping (<a href="https://code.google.com/p/webscraping/" rel="nofollow">https://code.google.com/p/webscraping/</a>) + BeautifulSoup. What I like about webscraping is that it automatically creates a local cache of the page you access so you don't end up needlessly hitting the site while you are testing the scraper.

senteabout 11 years ago

He recommends <a href="http://51proxy.com/" rel="nofollow">http://51proxy.com/</a> - I was curious and signed up for their smallest package. It's been 24 hours since I created the account and I have 48 "Waiting" proxies, and I haven't been able to connect to either of the "Live" ones.Has anyone had any success with them?

gjredaabout 11 years ago

For those interested, I wrote a "web scraping 101" tutorial in the past that uses BeautifulSoup: <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/" rel="nofollow">http://www.gregreda.com/2013/03/03/web-scraping-101-with-pyt...</a>

lmzabout 11 years ago

For all those mentioning scrapy, would it be a good fit for authenticated scraping with parameters (log in to different banks and get recent transactions)?

callmeedabout 11 years ago

I do a fair amount of web scraping with Ruby. Can anyone who has dabbled in both weigh in? Does Python offer superior libraries/tools?

评论 #7375804 未加载

评论 #7375806 未加载

评论 #7376357 未加载

lukasmabout 11 years ago

Coolio. I'd use PyQuery instead of xpath.

评论 #7375845 未加载

spaceswordabout 11 years ago

How do the ruby web scrapers compare to the python ones?

antocvabout 11 years ago

Scrapy is made for this.