TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Robots.txt Disallow: 20 Years of Mistakes To Avoid

106 点作者 hornokplease将近 11 年前

11 条评论

Asparagirl将近 11 年前
This article forgot the <i>very</i> worst use of robots.txt:<p><pre><code> User-agent: ia_archiver Disallow: &#x2F; </code></pre> Those two lines mean that all content hosted on the entire site will be blocked from the Internet Archive (archive.org) WayBack Machine, and the public will be unable to look at any previous versions of the website&#x27;s content. It wipes out a public view of the past.<p>Yeah, I&#x27;m looking at you, Washington Post: <a href="http://www.washingtonpost.com/robots.txt" rel="nofollow">http:&#x2F;&#x2F;www.washingtonpost.com&#x2F;robots.txt</a><p>Banning access to history like that is shameful.
评论 #7967981 未加载
评论 #7967450 未加载
评论 #7967038 未加载
评论 #7967574 未加载
TheLoneWolfling将近 11 年前
What frustrates me is the number of websites that impose additional restrictions on anything they don&#x27;t recognize, or worse, websites that impose additional restrictions on (or worse yet, just outright ban) anything that isn&#x27;t Googlebot.<p>And people wonder why alternative search engines have such a hard time taking off.
评论 #7969977 未加载
dredge将近 11 年前
The article contains some good observations, but I&#x27;m struggling to understand this one:<p>&quot;Some sites try to communicate with Google through comments in robots.txt&quot;<p>In the examples given, none appear to be trying to &quot;communicate with Google through comments&quot; - how is including...<p><pre><code> # What&#x27;s all this then? # \ # # ----- # | . . | # ----- # \--|-|--&#x2F; # | | # |-------| </code></pre> ...a &quot;mistake&quot; to avoid? There&#x27;s no harm in it at all.
评论 #7966980 未加载
评论 #7966938 未加载
freddielarge将近 11 年前
fun fact: robots.txt can also be used by attackers to find admin interfaces or other sensitive tidbits that you don&#x27;t want search engines to crawl<p>lots of target-detection crawlers will look at robots.txt as the first thing they do to see if there&#x27;s any fun pages you don&#x27;t want the other crawlers to see
评论 #7968656 未加载
spaulo12将近 11 年前
In the past I&#x27;ve created an empty robots.txt just to keep the 404 errors out of my logs...
sp332将近 11 年前
Why does Google ignore the crawl delay?
评论 #7967198 未加载
评论 #7967490 未加载
pipihu将近 11 年前
The main use for robots.txt is to prevent crawling of infinite URL spaces: <a href="http://googlewebmastercentral.blogspot.com.br/2008/08/to-infinity-and-beyond-no.html" rel="nofollow">http:&#x2F;&#x2F;googlewebmastercentral.blogspot.com.br&#x2F;2008&#x2F;08&#x2F;to-inf...</a><p>Alongside tagging links to such resources with nofollow.
评论 #7967381 未加载
sbierwagen将近 11 年前
My server returns 410 GONE to robots.txt requests.<p>The robots exclusion protocol is a ridiculous anachronism. I don&#x27;t use it and neither should you.
评论 #7967660 未加载
franze将近 11 年前
yeah, robots.txt is a horrible standard. trust me, i wrote <a href="https://www.npmjs.org/package/robotstxt" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.org&#x2F;package&#x2F;robotstxt</a> just so that i can really understand what is going on. it&#x27;s based on <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;webmasters&#x2F;control-crawl-index...</a><p>the article is pretty much correct (although strangely worded at some times), the stuff about &quot;communicating via robotst comments to google&quot; is of course not true. the example he gives are developer jokes, nothing more.<p>still, you should not use comments in the robots.txt, why?<p>you can group user agents i.e.:<p><pre><code> User-agent: Googlebot User-agent: bingbot User-Agent: Yandex Disallow: &#x2F; </code></pre> Congrats, you have just disallowed googlebot, bingbot and yandox from crawling (not indexing, just crawling)<p>ok, now:<p><pre><code> User-agent: Googlebot #User-agent: bingbot User-Agent: Yandex Disallow: &#x2F; </code></pre> so well, you have definitly blocked yandex, you do not care for bingbot (commented out), but what about googlebot? is googlebot and yandex part of a user-agent group? or is googlebot it&#x27;s own group and yandex it&#x27;s own group? if the commented line is interpredted as blank line, then googlebot and yandex are different groups, if it&#x27;s interpredted are as non existent, they belong together.<p>they way i read the spec <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;webmasters&#x2F;control-crawl-index...</a>, this behaviour is undefined. (pleae correct me if i&#x27;m wrong)<p>simple solution: don&#x27;t use comments in the robots.txt file.<p>also, please somebody fork and take over <a href="https://www.npmjs.org/package/robotstxt" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.org&#x2F;package&#x2F;robotstxt</a> it has this undefined behaviour and it also does not follow HTTP 301 requests (which was unspecified when i coded it) and also it tries to do too much (fetching and analysing, it should only do one thing).<p>by the way, my recommendation is to have a robots.txt file like this<p><pre><code> User-agent: * Dissalow: Sitemap: http:&#x2F;&#x2F;www.example.com&#x2F;your-sitemap-index.xml </code></pre> and return HTTP 200<p>why: if you do not have a file there, then at some point in the future suddenly you will return HTTP 500 or HTTP 200 with some response, that can be misleading. also it&#x27;s quite common that the staging robots.txt file spills over into the real word, this happens as soon as you forget that you have to care about your real robots.txt<p>also read the spec <a href="https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;webmasters&#x2F;control-crawl-index...</a>
blueskin_将近 11 年前
There are enough malicious bots that do follow robots.txt to make it still an important option for most sites.
Istof将近 11 年前
500kb limit? you call that short and sweet?