Where was this 10 years ago when I was reverse engineering the Google robots.txt parser by feeding example robots.txt files and URLs into the Google webmaster tool? I actually went so far as to build a convoluted honeypot website and robots.txt to see what the Google crawler would do in the wild.<p>Having written the robots.txt parser at Blekko, I can tell you what standards there are incomplete and inconsistent.<p>Robots.txt files are usually written by hand using random text editors ("/n" vs "/r/n" vs a mix of both!) by people who have no idea what a programming language grammar is. Let alone follow BNF from the RFC. There are situations where adding a newline completely negates all your rules. Specifically, newlines between useragent lines nor between useragent lines and rules.<p>My first inclination was to build an RFC compliant parser and point to the standard if anyone complained. However, if you start looking at a cross section of robots.txt files, you see that very few are well formed.<p>With the addition of sitemaps, crawl-delay, and other non-standard syntax adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting point and what ends up on website can be broken and hard to interpret the author's meaning. For example, the Google parser allows for five possible spellings of DISALLOW, including DISALLAW.<p>If you read a few webmaster boards, you see that many website owners don't want a lesson in Backus–Naur form and are quick to get the torches and pitchforks if they feel some crawler is wasting their precious CPU cycles or cluttering up their log files. Having a robots.txt parser that "does what the webmaster intends" is critical. Sometimes, I couldn't figure out what some particular webmaster intended, let alone write a program that could. The only solution was to draft off of Google's de facto standard.<p>(To the webmaster with the broken robots.txt and links on every product page with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)<p>Here's the Perl for the Blekko robots.txt parser.
<a href="https://github.com/randomstring/ParseRobotsTXT" rel="nofollow">https://github.com/randomstring/ParseRobotsTXT</a>
I've been in disagreements with SEO people quite frequently about a "Noindex" directive for robots.txt. There seem to be a bunch of articles that are sent to me every time I question its existence[0][1]. Google's own documentation says that noindex should be in the meta HTML but the SEO people seem to trust these shady sites more.<p>I haven't read through all of the code but it assuming this is actually what's running on Google's scrapers this section [2] seems to be pretty conclusive evidence to me that this Noindex thing is bullshit.<p>[0] <a href="https://www.deepcrawl.com/blog/best-practice/robots-txt-noindex-the-best-kept-secret-in-seo/" rel="nofollow">https://www.deepcrawl.com/blog/best-practice/robots-txt-noin...</a><p>[1]<a href="https://www.stonetemple.com/does-google-respect-robots-txt-noindex-and-should-you-use-it/" rel="nofollow">https://www.stonetemple.com/does-google-respect-robots-txt-n...</a><p>[2] <a href="https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613326dd4dfc8c9b9a545e45/robots.cc#L262-L276" rel="nofollow">https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...</a>
The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.<p>For instance it explicitly says "To exclude all files except one: This is currently a bit awkward, as there is no "Allow" field."<p>And the behavior is so different between different parsers and website implementations that, for instance, the default parser in Python can't even successfully parse twitter.com's robots.txt file because of the newlines.<p>Most search engines obey it as a matter of principle but not all crawlers or archivers [1] do.<p>It's a good example of missing standards in the wild.<p>[0] <a href="https://www.robotstxt.org/robotstxt.html" rel="nofollow">https://www.robotstxt.org/robotstxt.html</a><p>[1] <a href="https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/" rel="nofollow">https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...</a>
I absolutely understand why they did this, but I have to say I was disappointed to see only 7 commits at <a href="https://github.com/google/robotstxt/commits/master" rel="nofollow">https://github.com/google/robotstxt/commits/master</a> dating back to June 25th.<p>When I read "This library has been around for 20 years and it contains pieces of code that were written in the 90's" my first thought was "that commit history must be FASCINATING".
> This library has been around for 20 years and it contains pieces of code that were written in the 90's.<p>Whilst I am sure there are good reasons for the omission, it would have been interesting to see the entirety of the commit history for this library.
Note that this is quite strict on what characters may be contained in a bots user agent. This is due to strictness in the REP standard.<p><a href="https://github.com/google/robotstxt/blob/master/robots_test.cc#L152" rel="nofollow">https://github.com/google/robotstxt/blob/master/robots_test....</a><p><pre><code> // A user-agent line is expected to contain only [a-zA-Z_-] characters and must
// not be empty. See REP I-D section "The user-agent line".
// https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1
</code></pre>
So you may need to adjust your bot’s UA for proper matching.<p>(Disclosure, I work at Google, though not on anything related to this.)
I wonder how much noindex contributes to lax security practices like storing sensitive user data on public pages and relying on not linking to the page to keep it private. I wonder how much is in the gap between "should be indexed" and "really ought to restrict access to authorized users only".
> how should they deal with robots.txt files that are hundreds of megabytes large?<p>What do huge robots.txt files like that contain? I tried a couple domains just now and the longest one I could find was GitHub's - <a href="https://github.com/robots.txt" rel="nofollow">https://github.com/robots.txt</a> - which is only about 30 kilobytes.
fun & useless little bit of trivia: Sci-Fi author [1] Charles Stross (who hangs around here) is the cause of the first robots.txt being invented.<p><a href="http://www.antipope.org/charlie/blog-static/2009/06/how_i_got_here_in_the_end_part_3.html" rel="nofollow">http://www.antipope.org/charlie/blog-static/2009/06/how_i_go...</a><p>(reminds me how Y Combinator's co-founder Robert Morris has a bit of youthful notoriety from a less innocent program)<p>[1] and former code monkey from the dot-com era
I guess lots of people misspell ~disalow~ disallow[1]<p>1. <a href="https://github.com/google/robotstxt/blob/master/robots.cc#L691" rel="nofollow">https://github.com/google/robotstxt/blob/master/robots.cc#L6...</a>
I doubt there's any vulns in the code seeing as its job for th last 20 years has been to parse input from the wild west that is the internet, and survive.<p>But I'm sure someone out there will fuzz it...
can this been seen as a initiative to make google robots.txt parser the internet standard? every webmaster will want to be compliant with google corner cases...
That's actually nice and straight forward and relatively simple. I had expected something over engineered with at least parts of the code dedicated on demonstrating how much smarter the code writer is than you. But it's not. Just a simple parser.
Seems strange to get excited about a robots.txt parser, but I feel oddly elated that Google decided to open source this. Would it be too much to hope that additional modules related to Search get released in the future? Google seems all too happy to play the "open" card except where it directly impacts their core business, so this is a good step in the right direction.
I don't understand the entire architecture behind search engines, but this seems like a pretty decent chunk of it.<p>What are the chances that Google is releasing this as a preemptive response to the likely impending antitrust action against them? It would allow the to respond to those allegations with something like, "all the technology we used to build a good search engine is out there. We can't help it if we're the most popular." (And they could say the same about most of their services: gmail, drive, etc.)
So, is it premature to expect a Go package by Google as well?<p>There's already <a href="https://github.com/temoto/robotstxt" rel="nofollow">https://github.com/temoto/robotstxt</a>
Is Golang significantly slower than c++ ? I thought Google had invented Golang to solve precisely these kinds of code for their internal use.<p>I had thought most of the systems code inside Google would be golang by now. is that not the case ?
the code doesnt look too big - I dont think porting is the big issue.
“Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.”<p>The amount of arrogance in this sentence is insane.<p>Because Google way is the only one true way?
Never before has a company stood on such a mountain of open source code, achieved so much money with it and contributed <i>so</i> <i>little</i><p>No really. Microsoft? BSD TCP/IP stack for win95 maybe saved them but there was trumpet winsock and probably would have survived to writing their own on the next release.<p>Google doesn't get off the ground and has literally no products and no services without the GPL code that they fork, provide remote access to a process running their fork and contribute nothing back. Good end run around the spirit of the GPL there and that has made them a fortune (they have many fortunes, that's just one of them).<p>New projects from google? They're only open source if google really need them to be, like Go which would get nowhere if it wasn't and be very expensive for google to have to train all their engineers rather than pushing that cost back on their employees.<p>At least they don't go in for software patents, right? Oh, wait...<p>At least they have a motto of "Don't be evil" Which we pretty much all have personally but it's great a corporation backs it. Corporate restructurings happen, sure, oh wait, the motto is now gone. "Do the right thing" Well this is fine and google do it, for all values of right that equal "profitable to google and career enhancing for senior execs".<p>But this is great a robots.txt parser that's open source. Someone other than google could do something useful for the web with that like writing a validator, because google won't. Seemingly because it's not their definition of "do the right thing."<p>"Better than facebook, better than facebook, any criticism of google is by people who don't like google so invalid." Only with more words. Or none just one button. Go.