TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Analyzing One Million Robots.txt Files

157 pointsby foobover 7 years ago

7 comments

zorpnerover 7 years ago
<i>The next steps towards standardization began when Google, Yahoo, and Microsoft came together to define and support the sitemap protocol in 2006. Then in 2007, they announced that all three of them would support the Sitemap directive in robots.txt files. And yes, that important piece of internet history from the blog of a formerly 125 Billion dollar company now only exists because it was archived by Archive.org.</i><p>The Internet Archive (archive.org) is currently running their end-of-year donation drive, if you value the work they do it&#x27;s a good time to donate: <a href="https:&#x2F;&#x2F;archive.org&#x2F;donate&#x2F;" rel="nofollow">https:&#x2F;&#x2F;archive.org&#x2F;donate&#x2F;</a><p>(and on the topic of robots.txt, it sounds like they&#x27;re moving in the direction of disallowing people from using them indiscriminately to block access to valuable archival materials: <a href="https:&#x2F;&#x2F;blog.archive.org&#x2F;2017&#x2F;04&#x2F;17&#x2F;robots-txt-meant-for-search-engines-dont-work-well-for-web-archives&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.archive.org&#x2F;2017&#x2F;04&#x2F;17&#x2F;robots-txt-meant-for-sea...</a> )
评论 #16002102 未加载
benfredericksonover 7 years ago
I also wrote up an analysis of the top 1M robots.txt files: <a href="http:&#x2F;&#x2F;www.benfrederickson.com&#x2F;robots-txt-analysis&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.benfrederickson.com&#x2F;robots-txt-analysis&#x2F;</a><p>I ended up analyzing very different things from this article though, so this article was still pretty interesting to me.
jp_scover 7 years ago
<p><pre><code> “traditionally used for vague attempts at humor which signal to twenty-something white males that this is a “cool” place to work.”</code></pre> WTF with the casual sexism&#x2F;ageism?
评论 #16000921 未加载
评论 #16000899 未加载
评论 #16002020 未加载
评论 #16002636 未加载
评论 #16000773 未加载
评论 #16000856 未加载
feelin_googleyover 7 years ago
&quot;The web servers might not have cared about the traffic, but it turns out that you can only look up domains so quickly before a DNS server starts to question your intentions!&quot;<p>s&#x2F;DNS server&#x2F;third party open resolver&#x2F;<p>IME, querying an authoritative server for the desired name triggers no such limitations.<p>One does not even need to use DNS to get the IP addresses for those authoritative servers, if the zone file is made available for free to the public as most are, under the ICANN rules.<p>I have thought about building a database of robots.txt many times. IMO, robots.txt has an important role besides thwarting &quot;bots&quot;. It can thwart humans as well. It can be used to make entire websites &quot;disappear&quot; from the Internet Archive Wayback Machine.<p>Perhaps others are making mirrors of the IA.<p>However, I have thought it could be useful to monitor the robots.txt of important websites on a more frequent basis than IA, in order to (if possible) preemptively archive the IA&#x27;s collections if robots.txt changes are ever detected that would effectively &quot;erase&quot; them from the IA.<p>Perhaps the greatest thing about robots.txt is that it is &quot;plain text&quot;. This &quot;rule&quot; <i>seems</i> to be ubiquitously honoured. Did the author ever find any html, css, javascript or other surprises in any robots.txt file?
评论 #16001290 未加载
mindBover 7 years ago
History presented in this post was very interesting, but the analysis ended up disappointing. The article ends just after they had managed to narrow their sample of robots.txt files to exclude duplicate and derivative files. They don&#x27;t even present any summary statistics for this filtered sample.
tomcamover 7 years ago
Surprisingly interesting post that goes into history of robots.txt and details how it is not, in fact, a W3 standard or legal requirement
评论 #16001043 未加载
CM30over 7 years ago
Honestly, I&#x27;m kind of surprised that turnitin&#x27;s bot listens to robot.txt, or that the &#x27;anti copyright infringement&#x27; bots do the same. Seems like it provides a very simple way for a cheating site to just thwart their entire &#x27;system&#x27;.<p>But hey, I guess it&#x27;s one of those cases where the law and basic ethics clash a bit; with certain laws saying &#x27;unauthorised&#x27; access to a server is illegal, then ignoring that would leave them under fire for that instead.