yeah, robots.txt is a horrible standard. trust me, i wrote <a href="https://www.npmjs.org/package/robotstxt" rel="nofollow">https://www.npmjs.org/package/robotstxt</a> just so that i can really understand what is going on. it's based on <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https://developers.google.com/webmasters/control-crawl-index...</a><p>the article is pretty much correct (although strangely worded at some times), the stuff about "communicating via robotst comments to google" is of course not true. the example he gives are developer jokes, nothing more.<p>still, you should not use comments in the robots.txt, why?<p>you can group user agents i.e.:<p><pre><code> User-agent: Googlebot
User-agent: bingbot
User-Agent: Yandex
Disallow: /
</code></pre>
Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)<p>ok, now:<p><pre><code> User-agent: Googlebot
#User-agent: bingbot
User-Agent: Yandex
Disallow: /
</code></pre>
so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it's own group and yandex it's own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it's interpredted are as non existent, they belong together.<p>they way i read the spec <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https://developers.google.com/webmasters/control-crawl-index...</a>, this behaviour is undefined. (pleae correct me if i'm wrong)<p>simple solution: don't use comments in the robots.txt file.<p>also, please somebody fork and take over <a href="https://www.npmjs.org/package/robotstxt" rel="nofollow">https://www.npmjs.org/package/robotstxt</a> it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).<p>by the way, my recommendation is to have a robots.txt file like this<p><pre><code> User-agent: *
Dissalow:
Sitemap: http://www.example.com/your-sitemap-index.xml
</code></pre>
and return HTTP 200<p>why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it's quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txt<p>also read the spec <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https://developers.google.com/webmasters/control-crawl-index...</a>