科技回声

A site offers any visitor (authenticated or not) free download of documents at a certain path. This path is disallowed from being crawled by all user agents in robots.txt. What is the consensus around using something like Mechanical Turk to distribute the process of physically clicking the free download link and collecting the documents? Would this fall into the "avoiding a technological control" category? I know, I know, I should ask a lawyer, but I'm interested in the community's opinion on the practice.

If you have to ask, you probably already know the answer. A more important question is what do you plan to do with the files, and is this use allowed by the terms? If it is, then you can possibly ask the site to allow you access to download them. It is possible that they disallow crawling just to reduce load on their servers, and so crawlers don't waste time on text files when there is more valuable content to crawl elsewhere on the site.

robots.txt is not legally enforceable.

Ask HN: Using mTurk to morally/legally get around a robots.txt disallow?

2 条评论

Ask HN: Using mTurk to morally/legally get around a robots.txt disallow?

2 条评论