Google’s robots.txt parser is now open source

777 pointsby dankohn1almost 6 years ago

24 comments

randomstringalmost 6 years ago

Where was this 10 years ago when I was reverse engineering the Google robots.txt parser by feeding example robots.txt files and URLs into the Google webmaster tool? I actually went so far as to build a convoluted honeypot website and robots.txt to see what the Google crawler would do in the wild.Having written the robots.txt parser at Blekko, I can tell you what standards there are incomplete and inconsistent.Robots.txt files are usually written by hand using random text editors ("/n" vs "/r/n" vs a mix of both!) by people who have no idea what a programming language grammar is. Let alone follow BNF from the RFC. There are situations where adding a newline completely negates all your rules. Specifically, newlines between useragent lines nor between useragent lines and rules.My first inclination was to build an RFC compliant parser and point to the standard if anyone complained. However, if you start looking at a cross section of robots.txt files, you see that very few are well formed.With the addition of sitemaps, crawl-delay, and other non-standard syntax adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting point and what ends up on website can be broken and hard to interpret the author's meaning. For example, the Google parser allows for five possible spellings of DISALLOW, including DISALLAW.If you read a few webmaster boards, you see that many website owners don't want a lesson in Backus–Naur form and are quick to get the torches and pitchforks if they feel some crawler is wasting their precious CPU cycles or cluttering up their log files. Having a robots.txt parser that "does what the webmaster intends" is critical. Sometimes, I couldn't figure out what some particular webmaster intended, let alone write a program that could. The only solution was to draft off of Google's de facto standard.(To the webmaster with the broken robots.txt and links on every product page with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)Here's the Perl for the Blekko robots.txt parser. <a href="https://github.com/randomstring/ParseRobotsTXT" rel="nofollow">https://github.com/randomstring/ParseRobotsTXT</a>

评论 #20330230 未加载

评论 #20330114 未加载

评论 #20332357 未加载

jxclalmost 6 years ago

I've been in disagreements with SEO people quite frequently about a "Noindex" directive for robots.txt. There seem to be a bunch of articles that are sent to me every time I question its existence[0][1]. Google's own documentation says that noindex should be in the meta HTML but the SEO people seem to trust these shady sites more.I haven't read through all of the code but it assuming this is actually what's running on Google's scrapers this section [2] seems to be pretty conclusive evidence to me that this Noindex thing is bullshit.[0] <a href="https://www.deepcrawl.com/blog/best-practice/robots-txt-noindex-the-best-kept-secret-in-seo/" rel="nofollow">https://www.deepcrawl.com/blog/best-practice/robots-txt-noin...</a>[1]<a href="https://www.stonetemple.com/does-google-respect-robots-txt-noindex-and-should-you-use-it/" rel="nofollow">https://www.stonetemple.com/does-google-respect-robots-txt-n...</a>[2] <a href="https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613326dd4dfc8c9b9a545e45/robots.cc#L262-L276" rel="nofollow">https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...</a>

评论 #20326254 未加载

评论 #20326445 未加载

评论 #20327098 未加载

评论 #20333705 未加载

评论 #20328326 未加载

评论 #20326400 未加载

wybiralalmost 6 years ago

The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.For instance it explicitly says "To exclude all files except one: This is currently a bit awkward, as there is no "Allow" field."And the behavior is so different between different parsers and website implementations that, for instance, the default parser in Python can't even successfully parse twitter.com's robots.txt file because of the newlines.Most search engines obey it as a matter of principle but not all crawlers or archivers [1] do.It's a good example of missing standards in the wild.[0] <a href="https://www.robotstxt.org/robotstxt.html" rel="nofollow">https://www.robotstxt.org/robotstxt.html</a>[1] <a href="https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/" rel="nofollow">https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...</a>

评论 #20326050 未加载

评论 #20326798 未加载

simonwalmost 6 years ago

I absolutely understand why they did this, but I have to say I was disappointed to see only 7 commits at <a href="https://github.com/google/robotstxt/commits/master" rel="nofollow">https://github.com/google/robotstxt/commits/master</a> dating back to June 25th.When I read "This library has been around for 20 years and it contains pieces of code that were written in the 90's" my first thought was "that commit history must be FASCINATING".

评论 #20328949 未加载

评论 #20328969 未加载

douglasfsheareralmost 6 years ago

> This library has been around for 20 years and it contains pieces of code that were written in the 90's.Whilst I am sure there are good reasons for the omission, it would have been interesting to see the entirety of the commit history for this library.

评论 #20325615 未加载

rasmialmost 6 years ago

Code here: <a href="https://github.com/google/robotstxt" rel="nofollow">https://github.com/google/robotstxt</a>

评论 #20327280 未加载

jchwalmost 6 years ago

Note that this is quite strict on what characters may be contained in a bots user agent. This is due to strictness in the REP standard.<a href="https://github.com/google/robotstxt/blob/master/robots_test.cc#L152" rel="nofollow">https://github.com/google/robotstxt/blob/master/robots_test....</a><pre><code> // A user-agent line is expected to contain only [a-zA-Z_-] characters and must // not be empty. See REP I-D section "The user-agent line". // https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1 </code></pre> So you may need to adjust your bot’s UA for proper matching.(Disclosure, I work at Google, though not on anything related to this.)

评论 #20325791 未加载

Causality1almost 6 years ago

I wonder how much noindex contributes to lax security practices like storing sensitive user data on public pages and relying on not linking to the page to keep it private. I wonder how much is in the gap between "should be indexed" and "really ought to restrict access to authorized users only".

评论 #20327523 未加载

rococodealmost 6 years ago

> how should they deal with robots.txt files that are hundreds of megabytes large?What do huge robots.txt files like that contain? I tried a couple domains just now and the longest one I could find was GitHub's - <a href="https://github.com/robots.txt" rel="nofollow">https://github.com/robots.txt</a> - which is only about 30 kilobytes.

评论 #20326523 未加载

AceJohnny2almost 6 years ago

fun & useless little bit of trivia: Sci-Fi author [1] Charles Stross (who hangs around here) is the cause of the first robots.txt being invented.<a href="http://www.antipope.org/charlie/blog-static/2009/06/how_i_got_here_in_the_end_part_3.html" rel="nofollow">http://www.antipope.org/charlie/blog-static/2009/06/how_i_go...</a>(reminds me how Y Combinator's co-founder Robert Morris has a bit of youthful notoriety from a less innocent program)[1] and former code monkey from the dot-com era

orfalmost 6 years ago

I guess lots of people misspell ~disalow~ disallow[1]1. <a href="https://github.com/google/robotstxt/blob/master/robots.cc#L691" rel="nofollow">https://github.com/google/robotstxt/blob/master/robots.cc#L6...</a>

评论 #20326774 未加载

noir-yorkalmost 6 years ago

I doubt there's any vulns in the code seeing as its job for th last 20 years has been to parse input from the wild west that is the internet, and survive.But I'm sure someone out there will fuzz it...

评论 #20326266 未加载

pedrorijo91almost 6 years ago

can this been seen as a initiative to make google robots.txt parser the internet standard? every webmaster will want to be compliant with google corner cases...

评论 #20330425 未加载

jhabdasalmost 6 years ago

Anyone else witnessed this behavior? <a href="https://stackoverflow.com/questions/4769140/robots-txt-user-agent-googlebot-disallow-google-still-indexing/52732538#52732538" rel="nofollow">https://stackoverflow.com/questions/4769140/robots-txt-user-...</a>

评论 #20327213 未加载

nn3almost 6 years ago

That's actually nice and straight forward and relatively simple. I had expected something over engineered with at least parts of the code dedicated on demonstrating how much smarter the code writer is than you. But it's not. Just a simple parser.

评论 #20325719 未加载

评论 #20325689 未加载

评论 #20325670 未加载

jaredcwhitealmost 6 years ago

Seems strange to get excited about a robots.txt parser, but I feel oddly elated that Google decided to open source this. Would it be too much to hope that additional modules related to Search get released in the future? Google seems all too happy to play the "open" card except where it directly impacts their core business, so this is a good step in the right direction.

goddtriffinalmost 6 years ago

Looking forward to the robots.txt linters created as wrappers around this (especially for VSCode).

danielovichdkalmost 6 years ago

I find it really cool the code for this is so simple and clean.

unchicalmost 6 years ago

I don't understand the entire architecture behind search engines, but this seems like a pretty decent chunk of it.What are the chances that Google is releasing this as a preemptive response to the likely impending antitrust action against them? It would allow the to respond to those allegations with something like, "all the technology we used to build a good search engine is out there. We can't help it if we're the most popular." (And they could say the same about most of their services: gmail, drive, etc.)

Tepixalmost 6 years ago

So, is it premature to expect a Go package by Google as well?There's already <a href="https://github.com/temoto/robotstxt" rel="nofollow">https://github.com/temoto/robotstxt</a>

评论 #20325759 未加载

Jahakalmost 6 years ago

Cool, thanks

sandGorgonalmost 6 years ago

Is Golang significantly slower than c++ ? I thought Google had invented Golang to solve precisely these kinds of code for their internal use.I had thought most of the systems code inside Google would be golang by now. is that not the case ? the code doesnt look too big - I dont think porting is the big issue.

评论 #20326081 未加载

评论 #20327898 未加载

评论 #20326357 未加载

评论 #20326425 未加载

评论 #20328289 未加载

评论 #20326097 未加载

评论 #20326320 未加载

评论 #20332119 未加载

noncomlalmost 6 years ago

“Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.”The amount of arrogance in this sentence is insane.Because Google way is the only one true way?

评论 #20327968 未加载

评论 #20327992 未加载

harry8almost 6 years ago

Never before has a company stood on such a mountain of open source code, achieved so much money with it and contributed so littleNo really. Microsoft? BSD TCP/IP stack for win95 maybe saved them but there was trumpet winsock and probably would have survived to writing their own on the next release.Google doesn't get off the ground and has literally no products and no services without the GPL code that they fork, provide remote access to a process running their fork and contribute nothing back. Good end run around the spirit of the GPL there and that has made them a fortune (they have many fortunes, that's just one of them).New projects from google? They're only open source if google really need them to be, like Go which would get nowhere if it wasn't and be very expensive for google to have to train all their engineers rather than pushing that cost back on their employees.At least they don't go in for software patents, right? Oh, wait...At least they have a motto of "Don't be evil" Which we pretty much all have personally but it's great a corporation backs it. Corporate restructurings happen, sure, oh wait, the motto is now gone. "Do the right thing" Well this is fine and google do it, for all values of right that equal "profitable to google and career enhancing for senior execs".But this is great a robots.txt parser that's open source. Someone other than google could do something useful for the web with that like writing a validator, because google won't. Seemingly because it's not their definition of "do the right thing.""Better than facebook, better than facebook, any criticism of google is by people who don't like google so invalid." Only with more words. Or none just one button. Go.

评论 #20332324 未加载

评论 #20331886 未加载