TechEcho

11 comments

Asparagirlalmost 11 years ago

This article forgot the very worst use of robots.txt:<pre><code> User-agent: ia_archiver Disallow: / </code></pre> Those two lines mean that all content hosted on the entire site will be blocked from the Internet Archive (archive.org) WayBack Machine, and the public will be unable to look at any previous versions of the website's content. It wipes out a public view of the past.Yeah, I'm looking at you, Washington Post: <a href="http://www.washingtonpost.com/robots.txt" rel="nofollow">http://www.washingtonpost.com/robots.txt</a>Banning access to history like that is shameful.

评论 #7967981 未加载

评论 #7967450 未加载

评论 #7967038 未加载

评论 #7967574 未加载

TheLoneWolflingalmost 11 years ago

What frustrates me is the number of websites that impose additional restrictions on anything they don't recognize, or worse, websites that impose additional restrictions on (or worse yet, just outright ban) anything that isn't Googlebot.And people wonder why alternative search engines have such a hard time taking off.

评论 #7969977 未加载

dredgealmost 11 years ago

The article contains some good observations, but I'm struggling to understand this one:"Some sites try to communicate with Google through comments in robots.txt"In the examples given, none appear to be trying to "communicate with Google through comments" - how is including...<pre><code> # What's all this then? # \ # # ----- # | . . | # ----- # \--|-|--/ # | | # |-------| </code></pre> ...a "mistake" to avoid? There's no harm in it at all.

评论 #7966980 未加载

评论 #7966938 未加载

freddielargealmost 11 years ago

fun fact: robots.txt can also be used by attackers to find admin interfaces or other sensitive tidbits that you don't want search engines to crawllots of target-detection crawlers will look at robots.txt as the first thing they do to see if there's any fun pages you don't want the other crawlers to see

评论 #7968656 未加载

spaulo12almost 11 years ago

In the past I've created an empty robots.txt just to keep the 404 errors out of my logs...

sp332almost 11 years ago

Why does Google ignore the crawl delay?

评论 #7967198 未加载

评论 #7967490 未加载

pipihualmost 11 years ago

The main use for robots.txt is to prevent crawling of infinite URL spaces: <a href="http://googlewebmastercentral.blogspot.com.br/2008/08/to-infinity-and-beyond-no.html" rel="nofollow">http://googlewebmastercentral.blogspot.com.br/2008/08/to-inf...</a>Alongside tagging links to such resources with nofollow.

评论 #7967381 未加载

sbierwagenalmost 11 years ago

My server returns 410 GONE to robots.txt requests.The robots exclusion protocol is a ridiculous anachronism. I don't use it and neither should you.

评论 #7967660 未加载

franzealmost 11 years ago

yeah, robots.txt is a horrible standard. trust me, i wrote <a href="https://www.npmjs.org/package/robotstxt" rel="nofollow">https://www.npmjs.org/package/robotstxt</a> just so that i can really understand what is going on. it's based on <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https://developers.google.com/webmasters/control-crawl-index...</a>the article is pretty much correct (although strangely worded at some times), the stuff about "communicating via robotst comments to google" is of course not true. the example he gives are developer jokes, nothing more.still, you should not use comments in the robots.txt, why?you can group user agents i.e.:<pre><code> User-agent: Googlebot User-agent: bingbot User-Agent: Yandex Disallow: / </code></pre> Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)ok, now:<pre><code> User-agent: Googlebot #User-agent: bingbot User-Agent: Yandex Disallow: / </code></pre> so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it's own group and yandex it's own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it's interpredted are as non existent, they belong together.they way i read the spec <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https://developers.google.com/webmasters/control-crawl-index...</a>, this behaviour is undefined. (pleae correct me if i'm wrong)simple solution: don't use comments in the robots.txt file.also, please somebody fork and take over <a href="https://www.npmjs.org/package/robotstxt" rel="nofollow">https://www.npmjs.org/package/robotstxt</a> it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).by the way, my recommendation is to have a robots.txt file like this<pre><code> User-agent: * Dissalow: Sitemap: http://www.example.com/your-sitemap-index.xml </code></pre> and return HTTP 200why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it's quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txtalso read the spec <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https://developers.google.com/webmasters/control-crawl-index...</a>

blueskin_almost 11 years ago

There are enough malicious bots that do follow robots.txt to make it still an important option for most sites.

Istofalmost 11 years ago

500kb limit? you call that short and sweet?

11 comments

Asparagirlalmost 11 years ago

评论 #7967981 未加载

评论 #7967450 未加载

评论 #7967038 未加载

评论 #7967574 未加载

TheLoneWolflingalmost 11 years ago

评论 #7969977 未加载

dredgealmost 11 years ago

评论 #7966980 未加载

评论 #7966938 未加载

freddielargealmost 11 years ago

评论 #7968656 未加载

spaulo12almost 11 years ago

In the past I've created an empty robots.txt just to keep the 404 errors out of my logs...

sp332almost 11 years ago

Why does Google ignore the crawl delay?

评论 #7967198 未加载

评论 #7967490 未加载

pipihualmost 11 years ago

评论 #7967381 未加载

sbierwagenalmost 11 years ago

My server returns 410 GONE to robots.txt requests.The robots exclusion protocol is a ridiculous anachronism. I don't use it and neither should you.

评论 #7967660 未加载

franzealmost 11 years ago

blueskin_almost 11 years ago

There are enough malicious bots that do follow robots.txt to make it still an important option for most sites.

Istofalmost 11 years ago

500kb limit? you call that short and sweet?

Robots.txt Disallow: 20 Years of Mistakes To Avoid

11 comments

Robots.txt Disallow: 20 Years of Mistakes To Avoid

11 comments