TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: How to prevent unwanted scraping

11 点作者 alhenaadams超过 12 年前
The number 1 post right now is about how to use scraping to essentially get a handle on undocumented API's. That's all well and good, here are my questions to all HN: All this being said, how do we prevent our sites from being scraped in this way? What can you not get around, and what are the potential uses for an 'unscrapeable' site, in your opinions. Is the push to obfuscate with javascript a side effect of modern web app architecture or the intend in designs exhibiting such behavior?

6 条评论

karterk超过 12 年前
As someone who has done a lot of scraping in the past - you just need to change the CSS classnames or re-design your pages once in a while :) This breaks a lot of automated bots that extract semantic meaning from a webpage by using html + regexp parsers.<p>I would suggest staying away from using JS as it affects genuine users as well (e.g. those who use screen readers).
dalke超过 12 年前
It's very hard to impossible to prevent screen scraping. In the worst case scenario, the person scraping uses a Firefox instance running on a real display and controlled via a system like Sikuli to control the mouse the same way that a human would do it.<p>No, I take that back. The worst case scenario is hiring a team of people in some low-wage country to manually go through the site to extract the information.<p>How do you prevent those cases? I think the most you can do is throttle based on a mixture of login account and request IP address.<p>That said, the first step is to develop a threat model. You need to get an idea of why would people want to scrape your site, the incentive for them to do so, and the effect on your site and business if your data is scraped.
csense超过 12 年前
If some of the data you care about is user-generated, you might want to try the Github model: People can get a free account, but all the information they generate on the site will be public. Keeping your information private requires a paying subscription.
moocow01超过 12 年前
Use Flash or render your content as images. Neither of them are 100% locked down but its going to give anyone writing a scraper a run for their money.<p>Outside of preventing scraping, both of these ideas are likely to be seen as stupid.
评论 #4910287 未加载
ressaid1超过 12 年前
Check out services like www.distil.it or blockscraping.com
taligent超过 12 年前
There is nothing you can do to prevent scraping especially with tools like PhantomJS which use exactly the same engine as in your browser.<p>The ONLY way is to as suggested throttle based on IP address and X-Forwarded For.