Robots.txt is a suicide note

115 pointsby panzaabout 14 years ago

"Archive Team interprets ROBOTS.TXT as damage and temporary madness, and works around it. Everyone should. If you don't want people to have your data, don't put it online."

24 comments

forgotusernameabout 14 years ago

This is composed from equals parts of insight and daftness, though not entirely for the right reason.The daftness: maybe the claim is true that robots.txt was only a stop-gap measure back when web servers sucked, however the de facto modern use for it goes far beyond that, and ignoring that standard is likely to piss off lots of people.The insight: for crawlers, relying on robots.txt to prevent getting stuck indexing infinite hierarchies of data is a bad idea. It should be able to figure that much out for itself, so it doesn't explode when faced with sites that don't exclude such hierarchies using robots.txt.For servers, relying on a client hint to ensure reliability is daft. It should have some form of rate limiting built in, as that's the only sensible design. This seems the only marginally sensible use of robots.txt from a server standpoint. Using it for any form of security (e.g. preventing DB scraping) is daft, and a more robust mechanism should be employed there too.

评论 #2531625 未加载

评论 #2531487 未加载

gojomoabout 14 years ago

The great things about 'robots.txt' are (1) it's the simplest thing that could possibly work; and (2) the default assumption in the absence of webmaster effort is 'allow'.(2) is immensely valuable. Without it, search engines and the largest archive of web content, the Internet Archive (where I work on web archiving), could not exist at their current scales, as a practical matter.There's a place for ArchiveTeam's style of in-your-face, adversarial archiving... but if it were the dominant approach, the backlash from publishers and the law could result in prevailing conventions that are much worse than robots.txt, such as a default-deny/always-ask-permission-first regime. Search and archiving activities would have to be surreptitious, or limited to those with much deeper pockets for obscuring their actions, requesting/buying permission, or legal defenses.So, Jason, be careful what you wish for.

sgentleabout 14 years ago

I disagree with this post almost as strongly as I agree with it.Robots.txt is a suicide note. It's utter short-sighted hubris to say "this is MY information and I don't want you spidering it". Are you volunteering to maintain that information forever? Are you promising to never go out of business? Never be ordered to remove it by the government? Never be bought out by Oracle?Right now there seems to be a lot of confusion over the morality of information. People are possessed by the strange idea that you, mister content provider, own that content and have an inalienable right to control it any way you can get away with. But someday you will die, and your company will die, just like Geocities, Google Video, and the Library of Alexandria. Society should have a right to keep that information after you're gone.Of course, the law disagrees. And without the efforts of criminals like geohot, the iPhone DevTeam, The Nomad, Muslix64 and, yes, The Archive Team, people of the future will have no way to access the information we've locked up through our own paranoia. You don't have to cast your mind to a thousand years in the future - it's happening right now. Vast swathes of data are disappearing as DRM servers go dark only a few years after they appear (thanks, MSN Music, Yahoo Music Store).I believe that we owe it to our descendants to give them access to their history. I believe it's not our decision whether the things we make are too valuable or too uncomfortable to be preserved. And I believe that robots.txt is a suicide note, a product of the diseased minds that think our short-term desire for control outweighs our legacy.But I don't know what the fuck the article's talking about. It seems to be making a bunch of points that don't matter. Use robots.txt to prevent technical problems if you like, I don't care. Just don't use it to stop people from crawling your content or you're shitting on the future.

评论 #2532598 未加载

smosherabout 14 years ago

The rationale is weak. Some data is simply not worth indexing, and not worth serving up to bots. The flipside is: your crawler doesn't need to fetch everything on my site, and I'd be happy to ban all non-conforming bots site-wide.It's not just about the functionality, but also a show of good faith and basic respect. If you're a bot author who knowingly violates my site policy I'd rather you didn't communicate with my web server at all.robots.txt isn't perfect. Ideally a web server would be configured to deny bots access to restricted content via some sort of dnsbl mechanism (or CPAN/whatever module.) Or do both and ban the non-conforming site-wide.The above notwithstanding, I'm voting for this article. It doesn't betray the usual cowardice by hiding the assertion behind the presumptuous Why.

评论 #2531780 未加载

评论 #2531438 未加载

评论 #2531394 未加载

评论 #2532016 未加载

j_bakerabout 14 years ago

This may be a dumb move from a legal perspective. Court cases have alluded that robots.txt files may count as technological measures in DMCA cases[1]. Granted, that's far from guaranteed. But I certainly wouldn't want to be the one to go to court over it.[1] <a href="http://www.groklaw.net/article.php?story=20070819090725314&query=robots.txt" rel="nofollow">http://www.groklaw.net/article.php?story=20070819090725314&#...</a>

评论 #2531735 未加载

评论 #2531621 未加载

评论 #2531579 未加载

kunleyabout 14 years ago

This is a childish argument based upon an attitude "don't use robots.txt because it interferes with what we do and what we do is aw3s0m3 l337". This attitude is also prevailing in the archiveteam's comments here. I doubt their actions can be taken seriously.I wonder how this made into 88 votes here on HN..

adaml_623about 14 years ago

The Archiveteam.org favicon is a hand making a rude gesture. I think that sums up many peoples opinion to this story.It certainly is an indicator of how seriously you should take this organisation.

评论 #2535395 未加载

perlgeekabout 14 years ago

Their attitude can be summed up as "it's on the internet, it's ours to take". Ok, oversimplified, but that's the essence, no?So, dear archiveteam, please remember that when I put a server on the internet, it's a voluntary and public service, and putting 'Disallow:' lines in the robots.txt means that I set some rules. It's just rude to ignore those rules, whatever you motivations are.You have no right to access my content, just as you have no right to walk into my house. If I invite you, please behave.

评论 #2532322 未加载

cullenkingabout 14 years ago

Sorry, but this is terrible advice. Yes, you should make sure your site won't break if it's slammed by a large crawl. Yes you should hide destructive actions behind posts, not gets. But, robots.txt is insanely useful. If I didn't have a robots.txt file, google/bing/yahoo would index countless repetitive non-important files and my site would suffer in search engine ranking. In our case, we host GPX/KML files and textual cuesheets for driving and biking. If that stuff is indexed, our sites' relevant keywords are "left", "right" and GPS timestamp fragments like "0z01".So, use it wisely, but don't abandon its usage altogether.

yaixabout 14 years ago

robots.txt is simple and effective.I do not want certain bots, especially so-called "archives" to automatically download all my content. And that's what robots.txt is for and works well.The article is just stupid, sorry. There is not one real knowledgable argument.

评论 #2531798 未加载

GoodIntentionsabout 14 years ago

Wow. What arrogance.A Good reason to honeypot if you aren't already.It's expected. It's polite. Respect the site owner's published policy or expect to get IP banned like any other script kiddie because when a site admin see you ripping content, he isn't thinking "yay! archive team is here to do a free backup!" he thinks you are stealing his shit.

hammockabout 14 years ago

I love the tone of this article, I was smiling the whole time. Especially here-the onslaught of some social media hoo-hahedit: just clicked through a few pages- whoever does the writing at Archiveteam is fantastic!

评论 #2531482 未加载

评论 #2531644 未加载

mmaunderabout 14 years ago

I wonder if honeypots that auto-block rogue crawlers have occurred to these yoyos.

评论 #2531630 未加载

yuhongabout 14 years ago

BTW, robots.txt disables access to versions of pages already archived on the Wayback Machine. I encountered this when looking for old technotes on developer.apple.com.

评论 #2533387 未加载

zbowlingabout 14 years ago

I feel like creating a honey pot for bad bots now. Put an exclude line in ROBOTS.TXT and then include that URL in my pages and when a bot hits it anyways, ban the IP.

评论 #2531475 未加载

评论 #2531580 未加载

ssdsaabout 14 years ago

I don't really understand why they compare the Robots.txt file to a suicide note. Could anyone explain this?

评论 #2532455 未加载

tomjen3about 14 years ago

No, it is about not being willing to waste bandwidth and server capacity on unworthy projects (no person will ever search for my site through baidu but it still being indexed).Google and archive.org is one thing, I will be happy to support them.

Super_Jamboabout 14 years ago

So a little while ago we had a story which was essentially: "promote your startup / website by causing maximum outrage! Outrage is good! YAY PISSING PEOPLE OFF!"Now we get a non-story which is essentially designed to piss off the people of HN. Looking forward to more of the same given it works.

jbkabout 14 years ago

Well, sure, robots.txt is not the best solution, but it works and helps a lot when you got msnbot or yandexbot that takes more than half of the requests of your mediawiki (differences between revisions), your gitweb (commitdiffs) or your phpBB installation and kills the performance...Bored of having our machine killed by those bots, we use some robots.txt.Sure, there are other solutions (proper blocking), but this one works perfectly fine and avoids having to modify 3rd party applications that we are running for an open-source development team.

jbhelmsabout 14 years ago

The only thing I have ever used robots.txt for is to stop from leaking pagerank. I have a folder called /redirect/ and i exclude that folder in my robots.txt. I then link to sites like this /redirect/?l=www.mysite.comAnything I don't want archived, I put behind a login wall.

tszmingabout 14 years ago

Some parameter are useful, but they are not part of the standard.e.g. Crawl-Delay, prevent DDOS from YAHOO! Slurp

评论 #2531628 未加载

djmdjmabout 14 years ago

Dear "archiveteam", I pay by the MB. How do I opt out of your shit? KTHXBYE

评论 #2531700 未加载

eliabout 14 years ago

shrugThings like /search?q= are in my robots.txt because crawling those pages is just a waste of everyone's time and resources.

eduabout 14 years ago

Then you should block their crawler :)