I have an ASCII art Easter egg like this in an SEO product I made. :)<p><a href="https://www.checkbot.io/robots.txt" rel="nofollow">https://www.checkbot.io/robots.txt</a><p>I should probably add this SEO tip too because the purpose of robots.txt is confusing: If you want to remove/deindex a page from Google search, you counterintuitively need to <i>allow</i> the page to be crawled in the robots.txt file, and then add a noindex response header or noindex meta tag to the page. This way the crawler gets to see the noindex instruction. Robots.txt controls which pages can be crawled, not which pages can be indexed.
That’s a funny one!<p>Anyone knows of others like that?<p>Here is mine: <a href="https://FreeSolitaire.win/robots.txt" rel="nofollow">https://FreeSolitaire.win/robots.txt</a>
This is what happens if your robot isn't nice<p><pre><code> > curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
HTTP/2 403</code></pre>
One nice thing about CF's robots.txt is its inclusion of a sitemap:<p><a href="https://www.cloudflare.com/sitemap.xml" rel="nofollow">https://www.cloudflare.com/sitemap.xml</a><p>which contains links to educational materials like<p><a href="https://www.cloudflare.com/learning/ddos/layer-3-ddos-attacks/" rel="nofollow">https://www.cloudflare.com/learning/ddos/layer-3-ddos-attack...</a><p>Potentially interesting to see their flattened IA....
What's the purpose of "User-Agent: DemandbaseWebsitePreview/0.1"? I couldn't find anything about that agent, but I assume it's somehow related to demandbase.com?<p>But why are it and twitter the only whitelisted entries? Google and bing missing is a bit surprising, but I assume they're whitelisted through a different mechanism (like a google webmaster account)?
easy guess that length breaks some legacy stuff<p>but every robots.txt should have a auto-ban trap line<p>ie. crawl it and die<p>basically a script that puts the requesting IP into firewall<p>of course it's possible to abuse that so it has to be monitored