TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Robots.txt Disallow: 20 Years of Mistakes To Avoid

106 pointsby hornokpleasealmost 11 years ago

11 comments

Asparagirlalmost 11 years ago
This article forgot the <i>very</i> worst use of robots.txt:<p><pre><code> User-agent: ia_archiver Disallow: &#x2F; </code></pre> Those two lines mean that all content hosted on the entire site will be blocked from the Internet Archive (archive.org) WayBack Machine, and the public will be unable to look at any previous versions of the website&#x27;s content. It wipes out a public view of the past.<p>Yeah, I&#x27;m looking at you, Washington Post: <a href="http://www.washingtonpost.com/robots.txt" rel="nofollow">http:&#x2F;&#x2F;www.washingtonpost.com&#x2F;robots.txt</a><p>Banning access to history like that is shameful.
评论 #7967981 未加载
评论 #7967450 未加载
评论 #7967038 未加载
评论 #7967574 未加载
TheLoneWolflingalmost 11 years ago
What frustrates me is the number of websites that impose additional restrictions on anything they don&#x27;t recognize, or worse, websites that impose additional restrictions on (or worse yet, just outright ban) anything that isn&#x27;t Googlebot.<p>And people wonder why alternative search engines have such a hard time taking off.
评论 #7969977 未加载
dredgealmost 11 years ago
The article contains some good observations, but I&#x27;m struggling to understand this one:<p>&quot;Some sites try to communicate with Google through comments in robots.txt&quot;<p>In the examples given, none appear to be trying to &quot;communicate with Google through comments&quot; - how is including...<p><pre><code> # What&#x27;s all this then? # \ # # ----- # | . . | # ----- # \--|-|--&#x2F; # | | # |-------| </code></pre> ...a &quot;mistake&quot; to avoid? There&#x27;s no harm in it at all.
评论 #7966980 未加载
评论 #7966938 未加载
freddielargealmost 11 years ago
fun fact: robots.txt can also be used by attackers to find admin interfaces or other sensitive tidbits that you don&#x27;t want search engines to crawl<p>lots of target-detection crawlers will look at robots.txt as the first thing they do to see if there&#x27;s any fun pages you don&#x27;t want the other crawlers to see
评论 #7968656 未加载
spaulo12almost 11 years ago
In the past I&#x27;ve created an empty robots.txt just to keep the 404 errors out of my logs...
sp332almost 11 years ago
Why does Google ignore the crawl delay?
评论 #7967198 未加载
评论 #7967490 未加载
pipihualmost 11 years ago
The main use for robots.txt is to prevent crawling of infinite URL spaces: <a href="http://googlewebmastercentral.blogspot.com.br/2008/08/to-infinity-and-beyond-no.html" rel="nofollow">http:&#x2F;&#x2F;googlewebmastercentral.blogspot.com.br&#x2F;2008&#x2F;08&#x2F;to-inf...</a><p>Alongside tagging links to such resources with nofollow.
评论 #7967381 未加载
sbierwagenalmost 11 years ago
My server returns 410 GONE to robots.txt requests.<p>The robots exclusion protocol is a ridiculous anachronism. I don&#x27;t use it and neither should you.
评论 #7967660 未加载
franzealmost 11 years ago
yeah, robots.txt is a horrible standard. trust me, i wrote <a href="https://www.npmjs.org/package/robotstxt" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.org&#x2F;package&#x2F;robotstxt</a> just so that i can really understand what is going on. it&#x27;s based on <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;webmasters&#x2F;control-crawl-index...</a><p>the article is pretty much correct (although strangely worded at some times), the stuff about &quot;communicating via robotst comments to google&quot; is of course not true. the example he gives are developer jokes, nothing more.<p>still, you should not use comments in the robots.txt, why?<p>you can group user agents i.e.:<p><pre><code> User-agent: Googlebot User-agent: bingbot User-Agent: Yandex Disallow: &#x2F; </code></pre> Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)<p>ok, now:<p><pre><code> User-agent: Googlebot #User-agent: bingbot User-Agent: Yandex Disallow: &#x2F; </code></pre> so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it&#x27;s own group and yandex it&#x27;s own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it&#x27;s interpredted are as non existent, they belong together.<p>they way i read the spec <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;webmasters&#x2F;control-crawl-index...</a>, this behaviour is undefined. (pleae correct me if i&#x27;m wrong)<p>simple solution: don&#x27;t use comments in the robots.txt file.<p>also, please somebody fork and take over <a href="https://www.npmjs.org/package/robotstxt" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.org&#x2F;package&#x2F;robotstxt</a> it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).<p>by the way, my recommendation is to have a robots.txt file like this<p><pre><code> User-agent: * Dissalow: Sitemap: http:&#x2F;&#x2F;www.example.com&#x2F;your-sitemap-index.xml </code></pre> and return HTTP 200<p>why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it&#x27;s quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txt<p>also read the spec <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;webmasters&#x2F;control-crawl-index...</a>
blueskin_almost 11 years ago
There are enough malicious bots that do follow robots.txt to make it still an important option for most sites.
Istofalmost 11 years ago
500kb limit? you call that short and sweet?