Why not just use Scrapy[1]? It's built for this sort of thing, easily extensible, and written in Python.<p>[1] <a href="http://scrapy.org/" rel="nofollow">http://scrapy.org/</a>
Good resource - I've been using BeautifulSoup[1] for the scraper I set up for my needs, and it's probably worth checking out as well!<p>[1]: <a href="http://www.crummy.com/software/BeautifulSoup/" rel="nofollow">http://www.crummy.com/software/BeautifulSoup/</a>
PyQuery is pretty good for navigating around the DOM too.<p><a href="http://pythonhosted.org//pyquery/" rel="nofollow">http://pythonhosted.org//pyquery/</a>
Great summary on how to start on the topic, really nice! I only wish it was longer as I love playing with scraping (regex lover here) and unfortunately not many people consider going straight with lxml + xpath, which is ridiculously fast. Sometimes I see people writing a bunch of lines to walk through a tree of elements and using selectors with BeautifulSoup or even with a full Scrapy project and then I, like, "dude, why didn't you just extracted that tiny data with a single xpath?". Suggestion for a future update: try covering the caveats of lxml (invalid pages [technically not lxml's fault but okay], limitations of xpath 1.0 compared to 2.0 [not supported in lxml], tricky charset detection) and maybe throw a few samples codes both in BS and lxml to compare when to use each of them :-)
For a browser that runs JS the author mentions PhantomJS, but looks like its Python support is iffy. Mechanize is super-easy in Python:
<a href="http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet" rel="nofollow">http://www.pythonforbeginners.com/cheatsheet/python-mechaniz...</a><p>Edit: so easy, in fact, that I prefer to just START with using mechanize to fetch the pages--why bother testing whether or not your downloader needs JS, cookies, a reasonable user agent, etc--just start with them.
Like the OP, I needed more control over the crawling behaviour for a project. All the scraping code quickly became a mess though, so I wrote a library that lets you declaratively define the data you're interested in (think Django forms). It also provides decorators that allow you to specify imperative code for organizing the data cleanup before and after parsing. See how easy it is to extract data from the HN front page: <a href="https://github.com/aGHz/structominer/blob/master/examples/hn.py" rel="nofollow">https://github.com/aGHz/structominer/blob/master/examples/hn...</a><p>I'm still working on proper packaging so for the moment the only way to install Struct-o-miner is to clone it from <a href="https://github.com/aGHz/structominer" rel="nofollow">https://github.com/aGHz/structominer</a>.
> Use the Network tab in Chrome Developer Tools to find the AJAX request, you'll usually be greeted by a response in json.<p>But sometimes you won't. Sometimes you'll be assaulted with a response in a proprietary, obfuscated, or encrypted format. In situations where reverse-engineering the Javascript is unrealistic (perhaps it is equally obfuscated), I recommend Selenium[1][2] for scraping. It hooks a remote control to Firefox, Opera, Chrome, or IE, and allows you to read the data back out.<p>[1]: <a href="http://docs.seleniumhq.org/" rel="nofollow">http://docs.seleniumhq.org/</a>
[2]: <a href="http://docs.seleniumhq.org/projects/webdriver/" rel="nofollow">http://docs.seleniumhq.org/projects/webdriver/</a>
Test this real scrapper framework: <a href="https://github.com/hernan604/HTML-Robot-Scraper" rel="nofollow">https://github.com/hernan604/HTML-Robot-Scraper</a>
Warning: Self-Advertisment!<p><a href="http://www.joyofdata.de/blog/using-linux-shell-web-scraping/" rel="nofollow">http://www.joyofdata.de/blog/using-linux-shell-web-scraping/</a><p>Okay, it's not as bold has using Headers saying "Hire Me" but I would like to emphasize that sometimes even complex tasks can be super-easy when you use the right tools. And a combination of Linux shell tools makes this task really very straightforward (literally).
> If you need to extract data from a web page, then the chances are you looked for their API. Unfortunately this isn't always available and you sometimes have to fall back to web scraping.<p>Also, many times, not all functionality/features are available through the API.<p>Edit: By the way, without JS enabled, the code blocks on your website are basically unviewable (at least on Firefox).<p><a href="http://imgur.com/CSYxMfL" rel="nofollow">http://imgur.com/CSYxMfL</a>
This is a great starting point. Can anyone recommend any resources for how to best set up a remote scraping box on AWS or another similar provider? Pitfalls, best tools to help manage/automate scripts etc. I've found a few "getting started" tutorials like this one but I haven't been able to find anything good that discusses scraping beyond running basic scripts on your local machine.
I created a news aggregator completely based on a python scrapper. I run scrapping jobs as periodic celery tasks. usually I would look into the RSS feeds (to start) and parse them with "Feed Parser" module. Then I would use "pyquery" to find open graph tags.<p>Pyquery is an excellent module, but I think its parser is not very "Forgiving" so it might fail for some invalid/non-standard markups.
The part on how to avoid detection was particularly useful for me.<p>I use webscraping (<a href="https://code.google.com/p/webscraping/" rel="nofollow">https://code.google.com/p/webscraping/</a>) + BeautifulSoup. What I like about webscraping is that it automatically creates a local cache of the page you access so you don't end up needlessly hitting the site while you are testing the scraper.
He recommends <a href="http://51proxy.com/" rel="nofollow">http://51proxy.com/</a> - I was curious and signed up for their smallest package. It's been 24 hours since I created the account and I have 48 "Waiting" proxies, and I haven't been able to connect to either of the "Live" ones.<p>Has anyone had any success with them?
For those interested, I wrote a "web scraping 101" tutorial in the past that uses BeautifulSoup: <a href="http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/" rel="nofollow">http://www.gregreda.com/2013/03/03/web-scraping-101-with-pyt...</a>
For all those mentioning scrapy, would it be a good fit for authenticated scraping with parameters (log in to different banks and get recent transactions)?