TechEcho

10 comments

I have an ASCII art Easter egg like this in an SEO product I made. :)<a href="https://www.checkbot.io/robots.txt" rel="nofollow">https://www.checkbot.io/robots.txt</a>I should probably add this SEO tip too because the purpose of robots.txt is confusing: If you want to remove/deindex a page from Google search, you counterintuitively need to allow the page to be crawled in the robots.txt file, and then add a noindex response header or noindex meta tag to the page. This way the crawler gets to see the noindex instruction. Robots.txt controls which pages can be crawled, not which pages can be indexed.

评论 #42165078 未加载

palsecam6 months ago

That’s a funny one!Anyone knows of others like that?Here is mine: <a href="https://FreeSolitaire.win/robots.txt" rel="nofollow">https://FreeSolitaire.win/robots.txt</a>

评论 #42164327 未加载

评论 #42165802 未加载

jsheard6 months ago

This is what happens if your robot isn't nice<pre><code> > curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com HTTP/2 403</code></pre>

评论 #42164220 未加载

m-app6 months ago

What does “OUR TREE IS A REDWOOD” refer to? A quick search doesn’t yield any definite results.

评论 #42165185 未加载

chrisweekly6 months ago

One nice thing about CF's robots.txt is its inclusion of a sitemap:<a href="https://www.cloudflare.com/sitemap.xml" rel="nofollow">https://www.cloudflare.com/sitemap.xml</a>which contains links to educational materials like<a href="https://www.cloudflare.com/learning/ddos/layer-3-ddos-attacks/" rel="nofollow">https://www.cloudflare.com/learning/ddos/layer-3-ddos-attack...</a>Potentially interesting to see their flattened IA....

评论 #42165519 未加载

yapyap6 months ago

That’s cool, if any scrapers would still respect the robots.txt that is

评论 #42164168 未加载

评论 #42165017 未加载

评论 #42165663 未加载

评论 #42165000 未加载

CodesInChaos6 months ago

What's the purpose of "User-Agent: DemandbaseWebsitePreview/0.1"? I couldn't find anything about that agent, but I assume it's somehow related to demandbase.com?But why are it and twitter the only whitelisted entries? Google and bing missing is a bit surprising, but I assume they're whitelisted through a different mechanism (like a google webmaster account)?

评论 #42164338 未加载

评论 #42164695 未加载

op00to6 months ago

If those robots could read, they'd be very upset.

ck26 months ago

easy guess that length breaks some legacy stuffbut every robots.txt should have a auto-ban trap lineie. crawl it and diebasically a script that puts the requesting IP into firewallof course it's possible to abuse that so it has to be monitored

评论 #42166539 未加载

评论 #42165349 未加载

orliesaurus6 months ago

Has anyone worked on anything like this for AI scrapers?

评论 #42165055 未加载

评论 #42165872 未加载

评论 #42165005 未加载

10 comments

seanwilson6 months ago

评论 #42165078 未加载

palsecam6 months ago

That’s a funny one!Anyone knows of others like that?Here is mine: <a href="https://FreeSolitaire.win/robots.txt" rel="nofollow">https://FreeSolitaire.win/robots.txt</a>

评论 #42164327 未加载

评论 #42165802 未加载

jsheard6 months ago

This is what happens if your robot isn't nice<pre><code> > curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com HTTP/2 403</code></pre>

评论 #42164220 未加载

m-app6 months ago

What does “OUR TREE IS A REDWOOD” refer to? A quick search doesn’t yield any definite results.

评论 #42165185 未加载

chrisweekly6 months ago

评论 #42165519 未加载

yapyap6 months ago

That’s cool, if any scrapers would still respect the robots.txt that is

评论 #42164168 未加载

评论 #42165017 未加载

评论 #42165663 未加载

评论 #42165000 未加载

CodesInChaos6 months ago

评论 #42164338 未加载

评论 #42164695 未加载

op00to6 months ago

If those robots could read, they'd be very upset.

ck26 months ago

评论 #42166539 未加载

评论 #42165349 未加载

orliesaurus6 months ago

Has anyone worked on anything like this for AI scrapers?

评论 #42165055 未加载

评论 #42165872 未加载

评论 #42165005 未加载

Cloudflare.com's Robots.txt

10 comments

Cloudflare.com's Robots.txt

10 comments