TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Cloudflare.com's Robots.txt

145 pointsby sans_souse6 months ago

10 comments

seanwilson6 months ago
I have an ASCII art Easter egg like this in an SEO product I made. :)<p><a href="https:&#x2F;&#x2F;www.checkbot.io&#x2F;robots.txt" rel="nofollow">https:&#x2F;&#x2F;www.checkbot.io&#x2F;robots.txt</a><p>I should probably add this SEO tip too because the purpose of robots.txt is confusing: If you want to remove&#x2F;deindex a page from Google search, you counterintuitively need to <i>allow</i> the page to be crawled in the robots.txt file, and then add a noindex response header or noindex meta tag to the page. This way the crawler gets to see the noindex instruction. Robots.txt controls which pages can be crawled, not which pages can be indexed.
评论 #42165078 未加载
palsecam6 months ago
That’s a funny one!<p>Anyone knows of others like that?<p>Here is mine: <a href="https:&#x2F;&#x2F;FreeSolitaire.win&#x2F;robots.txt" rel="nofollow">https:&#x2F;&#x2F;FreeSolitaire.win&#x2F;robots.txt</a>
评论 #42164327 未加载
评论 #42165802 未加载
jsheard6 months ago
This is what happens if your robot isn&#x27;t nice<p><pre><code> &gt; curl -I -H &quot;User-Agent: Googlebot&quot; https:&#x2F;&#x2F;www.cloudflare.com HTTP&#x2F;2 403</code></pre>
评论 #42164220 未加载
m-app6 months ago
What does “OUR TREE IS A REDWOOD” refer to? A quick search doesn’t yield any definite results.
评论 #42165185 未加载
chrisweekly6 months ago
One nice thing about CF&#x27;s robots.txt is its inclusion of a sitemap:<p><a href="https:&#x2F;&#x2F;www.cloudflare.com&#x2F;sitemap.xml" rel="nofollow">https:&#x2F;&#x2F;www.cloudflare.com&#x2F;sitemap.xml</a><p>which contains links to educational materials like<p><a href="https:&#x2F;&#x2F;www.cloudflare.com&#x2F;learning&#x2F;ddos&#x2F;layer-3-ddos-attacks&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.cloudflare.com&#x2F;learning&#x2F;ddos&#x2F;layer-3-ddos-attack...</a><p>Potentially interesting to see their flattened IA....
评论 #42165519 未加载
yapyap6 months ago
That’s cool, if any scrapers would still respect the robots.txt that is
评论 #42164168 未加载
评论 #42165017 未加载
评论 #42165663 未加载
评论 #42165000 未加载
CodesInChaos6 months ago
What&#x27;s the purpose of &quot;User-Agent: DemandbaseWebsitePreview&#x2F;0.1&quot;? I couldn&#x27;t find anything about that agent, but I assume it&#x27;s somehow related to demandbase.com?<p>But why are it and twitter the only whitelisted entries? Google and bing missing is a bit surprising, but I assume they&#x27;re whitelisted through a different mechanism (like a google webmaster account)?
评论 #42164338 未加载
评论 #42164695 未加载
op00to6 months ago
If those robots could read, they&#x27;d be very upset.
ck26 months ago
easy guess that length breaks some legacy stuff<p>but every robots.txt should have a auto-ban trap line<p>ie. crawl it and die<p>basically a script that puts the requesting IP into firewall<p>of course it&#x27;s possible to abuse that so it has to be monitored
评论 #42166539 未加载
评论 #42165349 未加载
orliesaurus6 months ago
Has anyone worked on anything like this for AI scrapers?
评论 #42165055 未加载
评论 #42165872 未加载
评论 #42165005 未加载