TechEcho

6 comments

karterkover 12 years ago

As someone who has done a lot of scraping in the past - you just need to change the CSS classnames or re-design your pages once in a while :) This breaks a lot of automated bots that extract semantic meaning from a webpage by using html + regexp parsers.I would suggest staying away from using JS as it affects genuine users as well (e.g. those who use screen readers).

dalkeover 12 years ago

It's very hard to impossible to prevent screen scraping. In the worst case scenario, the person scraping uses a Firefox instance running on a real display and controlled via a system like Sikuli to control the mouse the same way that a human would do it.No, I take that back. The worst case scenario is hiring a team of people in some low-wage country to manually go through the site to extract the information.How do you prevent those cases? I think the most you can do is throttle based on a mixture of login account and request IP address.That said, the first step is to develop a threat model. You need to get an idea of why would people want to scrape your site, the incentive for them to do so, and the effect on your site and business if your data is scraped.

csenseover 12 years ago

If some of the data you care about is user-generated, you might want to try the Github model: People can get a free account, but all the information they generate on the site will be public. Keeping your information private requires a paying subscription.

moocow01over 12 years ago

Use Flash or render your content as images. Neither of them are 100% locked down but its going to give anyone writing a scraper a run for their money.Outside of preventing scraping, both of these ideas are likely to be seen as stupid.

评论 #4910287 未加载

ressaid1over 12 years ago

Check out services like www.distil.it or blockscraping.com

taligentover 12 years ago

There is nothing you can do to prevent scraping especially with tools like PhantomJS which use exactly the same engine as in your browser.The ONLY way is to as suggested throttle based on IP address and X-Forwarded For.

6 comments

karterkover 12 years ago

dalkeover 12 years ago

csenseover 12 years ago

moocow01over 12 years ago

评论 #4910287 未加载

ressaid1over 12 years ago

Check out services like www.distil.it or blockscraping.com

taligentover 12 years ago

Ask HN: How to prevent unwanted scraping

6 comments

Ask HN: How to prevent unwanted scraping

6 comments