I've done a ton of scraping (mostly legal: on behalf of end users of an app on sites they have legit access to). This article misses something that affects several sites: JavaScript driven content. Faking headers and even setting cookies doesn't get around this. This is of course is easy to get around, using something like phantom.js or Selenium. Selenium is great because unlike all the whiz bang scraping techniques, you're driving a real browser and your requests look real (if you make 10000 requests to index.php and never pull down a single image, you might look a bit suspicious). There's a bit more overhead, but micro instances on EC2 can easily run 2 or 3 Selenium sessions at the same time, and at 0.3 cents per hour for spot instances, you can have 200-300 browsers going for 30-50 cents/hour.
(shameless plug) I can scrape asynchronously, anonymously, with JS wizardry, and feed it into your defined models in your MVC (e.g. Django). But! I need to get to a hacker conference on the other side of the world (29c3). Any other time of year, I'd just drop a tutorial. See profile if you'd like to help me with a consulting gig.<p>EDIT: Knowledge isn't zero-sum. Here's an overview of a kick-ass way to spider/scrape:<p>I use Scrapy to spider asynchronously. When I define the crawler bot as an object, if the site contains complicated stuff (stateful forms or javascript) I usually create methods that involve importing either Mechanize or QtWebKit. Xpath selectors are also useful for the ability to not have to specify the entire XML tree from trunk to leaf. I then import pre-existing Django models from a site I want the data to go into and write to the DB. At this point you usually have to convert some types.<p>I find Scrapy cleaner and more like a pipeline so it seems to produce less 'side effect kludge' than other scraping methods (if anybody has seen a complex Beautiful Soup + Mechanize scraper you know what I mean by 'side effect kludge'). It can also act as a server to return json.<p>Being asynchronous, you can do crazy req/s.<p>I will leave out how to do all this through Tor because I don't want the Tor network being abused but am happy to talk about it one on one if your interest is beyond spamming the web.<p>Through this + a couple of unmentioned tricks, it's possible to get <i>insane</i> data, so much so it crosses over into security research & could be used for pen-testing.
And this is why we can't have nice things.<p>Web scraping, as fun as it is (and btw, this title <i>again</i> abuses "Fun and Profit"), is not a practice we should encourage. Yes, it's the sort of dirty practice many people do, at one point or another, but it shouldn't be glorified.
There are some recent federal cases (Weev <a href="http://www.wired.com/opinion/2012/11/att-ipad-hacker-when-embarassment-becomes-a-crime/" rel="nofollow">http://www.wired.com/opinion/2012/11/att-ipad-hacker-when-em...</a>, Aaron Swartz<a href="http://www.wired.com/threatlevel/2012/09/aaron-swartz-felony/" rel="nofollow">http://www.wired.com/threatlevel/2012/09/aaron-swartz-felony...</a>, and a prosecution of scalpers <a href="http://www.wired.com/threatlevel/2010/07/ticketmaster/" rel="nofollow">http://www.wired.com/threatlevel/2010/07/ticketmaster/</a>) that view scraping as a felony hacking offense. The feds think that an attempt to evade CAPTCHAS, IP and MAC blocks is a felony worthy of years in prison.<p>In fact, the feds might think that clearing your cookies or switching browsers to get another 10 free articles from the NYTimes is also felony hacking.<p>Which is to say, be careful what you admit to in this forum AND how you characterize what you are doing in your private conversations and e-mails.<p>Weev now faces a decade or more in prison because he drummed up publicity by sending emails to journalists that used the verb "stole".
From the article:<p><pre><code> Since the third party service conducted rate-limiting based on IP
address (stated in their docs), my solution was to put the code that
hit their service into some client-side Javascript, and then send
the results back to my server from each of the clients.
This way, the requests would appear to come from thousands of
different places, since each client would presumably have their own
unique IP address, and none of them would individually be going over
the rate limit.
</code></pre>
Pretty sure the browser Same Original Policy forbids this. Think about it- if this worked, you'd be able to scrape inside corporate firewalls simply by having users visit your website from behind the firewall.
The issue with web scraping is that it relies on the scraper to keep up with changes made to the site.<p>If a site owner changes the layout or implements a new feature, the programs depending on the scraper immediately fail. This is much less likely to happen when working with official APIs.
Great read!<p>In the past, I have successfully used HtmlUnit to fulfill my admittedly limited scraping needs.<p>It runs headless, but it has a virtual head designed to pretend it's a user visting a web application to be be tested for QA purposes. You just program it to go through the motions of a human visting a site to be tested (or scraped). E.g., click here, get some response. For each whatever in the response, click and aggregate the results in your output (to whatever granularity).<p>Alas, it's in Java. But, if you use JRuby, you can avoid most of the nastiness that implies. (You do need to <i>know</i> Java, but at least you don't have to <i>write</i> Java.)<p>Hartley, what is your recommended toolkit?<p>I note you mentioned the problem of dynamically generated content. You develop your plan of attack using the browser plus Chrome Inspector or Firebug. So far, so good. But what if you want to be headless? Then you need something that will generate a DOM as if presenting a real user interface but instead simply returns a reference to the DOM tree that you are free to scan and react to.
I love HTML scraping.
But Javascript???...The juiciest data sets these days are increasingly in JS.
For the love of me i can't get around scraping JS :(<p>I do know that Selenium can be used for this...but am yet to see a decent example for the same. Does anyone have any good resources/examples on JS scraping that they could share??
I would be eternally grateful.
Another issue not covered: file downloads. Let's say you have a process that creates a dynamic image, or logs in and downloads dynamic PDFs. Even Selenium can't handle this (the download dialog is an OS-level feature). At one point I was able to get Chrome to auto-download in Selenium, but had zero control over filename and where it was saving. I ended up using iMacros (the pay version) to drive this (using Windows instances: their Linux version is very immature comparably).
Scraping could be made a lot harder by website publishers, but they all depend on the biggest scraper accessing their content so it can bring traffic: Google ...<p>The biggest downside of scraping is that it often takes a long time for very little content (e.g. scraping online stores with extremely bloated HTML and 10-25 products/per page).
An important topic.<p>The main caveat is that this may violate a site's terms of use and thus website owners may feel called upon to sue you. Depending on circumstances, the legal situation here can be a long story.
Related: If you fancy writing scrapers for fun <i>and</i> profit, ScraperWiki (a Liverpool, UK-based data startup) is currently hiring full-time data scientists. Check us out!<p><a href="http://scraperwiki.com/jobs/#swjob5" rel="nofollow">http://scraperwiki.com/jobs/#swjob5</a>
The title makes it sound as if there is going to be some discussion of how the OP has made web scraping profitable, but this seems to have been left to the reader's imagination.<p>Otherwise, great article! I agree that BeautifulSoup is a great tool for this.
It's pointless to think of it as "wrong" for third-parties to web-scrape. Entities will do as they must to survive. The onus of mitigating web scraping, if in the interests of the publisher, is on the publisher.<p>As a startup developer, third-party scraping is something I need to be aware of, that I need to defend against if doing so suits my interests. A little bit of research shows that this is not impractical. Dynamic IP restrictions (or slowbanning), rudimentary data watermarking, caching of anonymous request output all mitigate this. Spot-checking popular content by running it through Google Search requires all of five minutes per week. At that point, the specific situation can be addressed holistically (a simple attribution license might make everyone happy). With enough research, one might consider hellbanning the offender (serving bogus content to requests satisfying some certain heuristic) as a deterrent. A legal pursuit with its cost would likely be a last resort.<p>Accept the possibility of being scraped and prepare accordingly.
People seem to wonder how to handle ajax.<p>The answer is HttpFox. It records all http-requests.<p>1. Start recording<p>2. Do some action that causes data to be fetched<p>3. Stop recording.<p>You will find the url, the returned data, and a nice table of get and post-variables.<p><a href="https://addons.mozilla.org/en-us/firefox/addon/httpfox/" rel="nofollow">https://addons.mozilla.org/en-us/firefox/addon/httpfox/</a>
From a site owner's perspective: if you have a LOT of data then scraping can be very disruptive. I've had someone scraping my site for literally months, using hundreds of different open proxies, plus multiple faked user-agents, in order to defeat scraping detection. At one point they were accessing my site over 300,000 times per day (3.5/sec), which exceeded the level of the next busiest (and welcome) agent... Googlebot. In total I estimate this person has made more than 30 million fetch attempts over the past few months. I eventually figured out a unique signature for their bot and blocked 95%+ of their attempts, but they still kept trying. I managed to find a contact for their network administrator and the constant door-knocking finally stopped today.
when i need to scrap a webpage, i use phpQuery (<a href="http://code.google.com/p/phpquery/" rel="nofollow">http://code.google.com/p/phpquery/</a>), it's dead simple if you have experience with jQuery and i get all the benefits of a server-side programming language.
What I wish I could do is capture Flash audio (or any audio) streams with my Mac. All I want is to listen to the audio-only content with an audio player when I'm out driving or jogging, etc. Audio-only content that has to be played off a web page usually runs into the contradiction that if I'm in a position to click buttons on my web browser (not driving, for example), I'm in a position to do my REAL work and have no time to listen to the audio. I'll go to the web page, see whatever ads they may have, but then I'd like to be able to "scrape" the audio stream into a file so I don't have to sit there staring at a static web page the whole time I'm listening.
When scraping HTML where data gets populated with js/ajax, you can get a web inspector to look at where that data is coming from and manually GET it and it will likely be in some nice JSON.<p>Scraping used to be the way to get data back in the days, but websites also didn't change their layout/structure on a weekly basis too back then and were much more static when it came to the structure.<p>Having recently written a small app that was forced to scrape HTML and having to update it every month to make it keep working, I can't imagine doing this for a larger project and maintaining it.
To all HN: All this being said, how do we prevent our sites from being scraped in this way? What can you not get around, and what are the potential uses for an 'unscrapeable' site to your mind.
I think the author just completly missed the point with API vs Screen scraping.
The API allows for accessing structured data. Even if the website changes once, the datas would be accessible the same way through the API.
Whereas, the author, would have to rewrite his code each time an update his made to the front-office code of the website.<p>A simple API providing simple json response with http basic auth is far more efficient than a screen scraping program where you have to parse the response using HTML / XML parsers.
This illustrates the significant difference between the use-cases of "web APIs" and conventional APIs, that the former are more like a database CRUD (including REST), rather than a request for computation. They (usually) are an alternative interface to a website (a GUI), and that's how most websites are used. e.g. an API for HN would allow story/comment retrieval, voting, submission, commenting.<p>They <i>could</i> be used for computation, but (mostly) aren't.
Not every site. There is data I would really love to access on Facebook without having to gain specific authorization from the user. It's odd that for most user profiles the most you can extract via the graph API (with no access token) is their name and sex. Whereas I can visit their profile page in the browser, see all sorts of info and latest updates (and not even be friends with them)<p>Tried scraping Facebook. They have IP blocks and the like.
This is a shameless plug but I've created a service that aims to help with a lot of the issues that OP describes such as rate limiting, JS and scaling. It's a bit like Heroku for web scraping and automation. It's still in beta but if anyone is interested then check out <a href="http://tubes.io" rel="nofollow">http://tubes.io</a>.
I have done a bit of scrapping with ruby mechanize, when we hit limits have circumvented by proxy and tor<p>google as a search engine crawls most all sites, but offers very few usable stuff to other bots<p><a href="http://www.google.com/robots.txt" rel="nofollow">http://www.google.com/robots.txt</a><p>Disallow 247
Allow 41
Be careful. I got banned from Google for scraping. I did a few hundred thousand searches one day, and that night, they banned my office IP address for a week. This was in 2001, so I estimate I cost them a few hundred dollars, which is now impossible to repay. :(
The problem with scraping instead of using the API is that when the website makes even a slight change to their markup it breaks your code. I have had that experience and it's a living hell. I can say it's not worth it to scrap when there is an API available.
There is just one major trouble with not needing stinking API. You can not POST as a possible client without requiring them to give their password to you, which actually would give you full access to their account instead of limited access with API.
I had to do some scraping of a rather Javascript-heavy site last year - I found the entire process was made almost trivial using Ruby and Nokugiri. Particularly relevant for a non-uber-programmer like me, it's simple to use, as well as powerful.
So bloody true. A web page is a resource just like an xml doc, there's no reason public facing urls and web content can't be treated as such and I regularly take advantage of that fact aswell. great post
If it's not automated and a fewer times, I will prefer IMacro to perform tasks on my behalf. The best part of it that you can integrate a Db to record your desired data.
Automated web testing tools, such as Watir and Selenium, are also pretty good options. I'm especially surprised Watir hasn't been mentioned yet in the comments.