Hey guys, I'm curious to hear what you think about page scraping. Are you happy to page scrape other sites? Are you happy to have yours scraped? What's your view on the ethics?<p>I'm asking because I'm aware my site's started to be scraped (by more than search engines), and I'm trying to figure out how I feel about it. In this case I'm happy because I know the audience/user base is smallish. If an app went mainstream that did it, I wouldn't be happy.<p>Of course the pragmatic answer is simple - 'build an api' - a few more weekends and I might.<p>But right now I'm interested to hear people's opinions on it.
I think that if you're generally respectful of the target websites - scraping them is ok. For example, I scrape various government websites for my website. I use a random delay between requests and am generally very careful about not requesting the same page multiple times (this is hard 'cause a lot of the pagination happening on these pages is via JS calls).<p>I am ok if someone decides to scrape my websites in a similar fashion - although if I do see that starting to happen, I'd rather just go ahead and build an API.
On the consumer side, I'm happy with the following rules.<p>I'm happy writing a program to let individual users scrape from their computers. After all, they have a right to visit the site and retrieve their data in whatever format suits them.<p>I'm not so keen on setting up a server to scrape data, or having a server scrape a huge pile of data for a list of users. After all, whoever is running the service is keeping stuff for all of the users. My taking it all is just stealing.<p>On the provider side, I think my feelings are about the same. I think you have to be careful that you leverage scraping -- let scrapers come in and get enough stuff that it makes people want to visit, but not so much that they have everything. If executed effectively, you can use scraping to great benefit.
I guess it depends on what they're doing with it. I'm not particularly against scraping per se, but I would look askance at some of the more sleazy uses, like just republishing (slightly modified versions of) blog posts on some AdSense-laden blog as if it were their own post. The key issues to me are: 1) transformativity, i.e. it produces something genuinely new and different from the content it scraped; and 2) proper credit to the source of the original content.
I'm happy to page scrape other sites and not happy to have mine scraped. ;-)<p>More seriously, if there're bots that you don't want scraping, just robots.txt them away. If they ignore that, <i>then</i> they're being rather rude and you can figure out some way to auto-block them.