"Archive Team interprets ROBOTS.TXT as damage and temporary madness, and works around it. Everyone should. If you don't want people to have your data, don't put it online."
This is composed from equals parts of insight and daftness, though not entirely for the right reason.<p>The daftness: maybe the claim is true that robots.txt was only a stop-gap measure back when web servers sucked, however the <i>de facto</i> modern use for it goes far beyond that, and ignoring that standard is likely to piss off lots of people.<p>The insight: for crawlers, relying on robots.txt to prevent getting stuck indexing infinite hierarchies of data is a bad idea. It should be able to figure that much out for itself, so it doesn't explode when faced with sites that don't exclude such hierarchies using robots.txt.<p>For servers, relying on a client hint to ensure reliability is daft. It should have some form of rate limiting built in, as that's the only sensible design. This seems the only marginally sensible use of robots.txt from a server standpoint. Using it for any form of security (e.g. preventing DB scraping) is daft, and a more robust mechanism should be employed there too.
The great things about 'robots.txt' are (1) it's the simplest thing that could possibly work; and (2) the default assumption in the absence of webmaster effort is 'allow'.<p>(2) is immensely valuable. Without it, search engines and the largest archive of web content, the Internet Archive (where I work on web archiving), could not exist at their current scales, as a practical matter.<p>There's a place for ArchiveTeam's style of in-your-face, adversarial archiving... but if it were the dominant approach, the backlash from publishers and the law could result in prevailing conventions that are much worse than robots.txt, such as a default-deny/always-ask-permission-first regime. Search and archiving activities would have to be surreptitious, or limited to those with much deeper pockets for obscuring their actions, requesting/buying permission, or legal defenses.<p>So, Jason, be careful what you wish for.
I disagree with this post almost as strongly as I agree with it.<p>Robots.txt <i>is</i> a suicide note. It's utter short-sighted hubris to say "this is MY information and I don't want you spidering it". Are you volunteering to maintain that information forever? Are you promising to never go out of business? Never be ordered to remove it by the government? Never be bought out by Oracle?<p>Right now there seems to be a lot of confusion over the morality of information. People are possessed by the strange idea that you, mister content provider, own that content and have an inalienable right to control it any way you can get away with. But someday you will die, and your company will die, just like Geocities, Google Video, and the Library of Alexandria. Society should have a right to keep that information after you're gone.<p>Of course, the law disagrees. And without the efforts of criminals like geohot, the iPhone DevTeam, The Nomad, Muslix64 and, yes, The Archive Team, people of the future will have no way to access the information we've locked up through our own paranoia. You don't have to cast your mind to a thousand years in the future - it's happening right now. Vast swathes of data are disappearing as DRM servers go dark only a few years after they appear (thanks, MSN Music, Yahoo Music Store).<p>I believe that we owe it to our descendants to give them access to their history. I believe it's not our decision whether the things we make are too valuable or too uncomfortable to be preserved. And I believe that robots.txt is a suicide note, a product of the diseased minds that think our short-term desire for control outweighs our legacy.<p>But I don't know what the fuck the article's talking about. It seems to be making a bunch of points that don't matter. Use robots.txt to prevent technical problems if you like, I don't care. Just don't use it to stop people from crawling your content or you're shitting on the future.
The rationale is weak. Some data is simply not worth indexing, and not worth serving up to bots. The flipside is: your crawler doesn't need to fetch everything on my site, and I'd be happy to ban all non-conforming bots site-wide.<p>It's not <i>just</i> about the functionality, but also a show of good faith and basic respect. If you're a bot author who knowingly violates my site policy I'd rather you didn't communicate with my web server at all.<p>robots.txt isn't perfect. Ideally a web server would be configured to deny bots access to restricted content via some sort of dnsbl mechanism (or CPAN/whatever module.) Or do both and ban the non-conforming site-wide.<p>The above notwithstanding, I'm voting for this article. It doesn't betray the usual cowardice by hiding the assertion behind the presumptuous <i>Why</i>.
This may be a dumb move from a legal perspective. Court cases have alluded that robots.txt files may count as technological measures in DMCA cases[1]. Granted, that's far from guaranteed. But I certainly wouldn't want to be the one to go to court over it.<p>[1] <a href="http://www.groklaw.net/article.php?story=20070819090725314&query=robots.txt" rel="nofollow">http://www.groklaw.net/article.php?story=20070819090725314&#...</a>
This is a childish argument based upon an attitude "don't use robots.txt because it interferes with what we do and what we do is aw3s0m3 l337". This attitude is also prevailing in the archiveteam's comments here. I doubt their actions can be taken seriously.<p>I wonder how this made into 88 votes here on HN..
The Archiveteam.org favicon is a hand making a rude gesture. I think that sums up many peoples opinion to this story.<p>It certainly is an indicator of how seriously you should take this organisation.
Their attitude can be summed up as "it's on the internet, it's ours to take". Ok, oversimplified, but that's the essence, no?<p>So, dear archiveteam, please remember that when I put a server on the internet, it's a voluntary and public service, and putting 'Disallow:' lines in the robots.txt means that I set some rules. It's just rude to ignore those rules, whatever you motivations are.<p>You have no right to access my content, just as you have no right to walk into my house. If I invite you, please behave.
Sorry, but this is terrible advice. Yes, you should make sure your site won't break if it's slammed by a large crawl. Yes you should hide destructive actions behind posts, not gets. But, robots.txt is insanely useful. If I didn't have a robots.txt file, google/bing/yahoo would index countless repetitive non-important files and my site would suffer in search engine ranking. In our case, we host GPX/KML files and textual cuesheets for driving and biking. If that stuff is indexed, our sites' relevant keywords are "left", "right" and GPS timestamp fragments like "0z01".<p>So, use it wisely, but don't abandon its usage altogether.
robots.txt is simple and effective.<p>I do not want certain bots, especially so-called "archives" to automatically download all my content. And that's what robots.txt is for and works well.<p>The article is just stupid, sorry. There is not one real knowledgable argument.
Wow. What arrogance.<p>A Good reason to honeypot if you aren't already.<p>It's expected. It's polite. Respect the site owner's published policy or expect to get IP banned like any other script kiddie because when a site admin see you ripping content, he isn't thinking "yay! archive team is here to do a free backup!" he thinks you are stealing his shit.
I love the tone of this article, I was smiling the whole time. Especially here-<p><i>the onslaught of some social media hoo-hah</i><p>edit: just clicked through a few pages- whoever does the writing at Archiveteam is fantastic!
BTW, robots.txt disables access to versions of pages already archived on the Wayback Machine. I encountered this when looking for old technotes on developer.apple.com.
I feel like creating a honey pot for bad bots now. Put an exclude line in ROBOTS.TXT and then include that URL in my pages and when a bot hits it anyways, ban the IP.
No, it is about not being willing to waste bandwidth and server capacity on unworthy projects (no person will ever search for my site through baidu but it still being indexed).<p>Google and archive.org is one thing, I will be happy to support them.
So a little while ago we had a story which was essentially: "promote your startup / website by causing maximum outrage! Outrage is good! YAY PISSING PEOPLE OFF!"<p>Now we get a non-story which is essentially designed to piss off the people of HN. Looking forward to more of the same given it works.
Well, sure, robots.txt is not the best solution, but it works and helps a lot when you got msnbot or yandexbot that takes more than half of the requests of your mediawiki (differences between revisions), your gitweb (commitdiffs) or your phpBB installation and kills the performance...<p>Bored of having our machine killed by those bots, we use some robots.txt.<p>Sure, there are other solutions (proper blocking), but this one works perfectly fine and avoids having to modify 3rd party applications that we are running for an open-source development team.
The only thing I have ever used robots.txt for is to stop from leaking pagerank. I have a folder called /redirect/ and i exclude that folder in my robots.txt. I then link to sites like this /redirect/?l=www.mysite.com<p>Anything I don't want archived, I put behind a login wall.