The number 1 post right now is about how to use scraping to essentially get a handle on undocumented API's. That's all well and good, here are my questions to all HN: All this being said, how do we prevent our sites from being scraped in this way? What can you not get around, and what are the potential uses for an 'unscrapeable' site, in your opinions. Is the push to obfuscate with javascript a side effect of modern web app architecture or the intend in designs exhibiting such behavior?
As someone who has done a lot of scraping in the past - you just need to change the CSS classnames or re-design your pages once in a while :) This breaks a lot of automated bots that extract semantic meaning from a webpage by using html + regexp parsers.<p>I would suggest staying away from using JS as it affects genuine users as well (e.g. those who use screen readers).
It's very hard to impossible to prevent screen scraping. In the worst case scenario, the person scraping uses a Firefox instance running on a real display and controlled via a system like Sikuli to control the mouse the same way that a human would do it.<p>No, I take that back. The worst case scenario is hiring a team of people in some low-wage country to manually go through the site to extract the information.<p>How do you prevent those cases? I think the most you can do is throttle based on a mixture of login account and request IP address.<p>That said, the first step is to develop a threat model. You need to get an idea of why would people want to scrape your site, the incentive for them to do so, and the effect on your site and business if your data is scraped.
If some of the data you care about is user-generated, you might want to try the Github model: People can get a free account, but all the information they generate on the site will be public. Keeping your information private requires a paying subscription.
Use Flash or render your content as images. Neither of them are 100% locked down but its going to give anyone writing a scraper a run for their money.<p>Outside of preventing scraping, both of these ideas are likely to be seen as stupid.
There is nothing you can do to prevent scraping especially with tools like PhantomJS which use exactly the same engine as in your browser.<p>The ONLY way is to as suggested throttle based on IP address and X-Forwarded For.