TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Why Are We Fighting AI Scrapers If the Internet Is Meant to Be Public?

3 pointsby iamarsibragimov2 months ago
Cloudflare launched “AI Labyrinth” — traps for AI scrapers: fake AI-generated pages that confuse bots and burn their resources.<p>But… what’s the point? Can someone explain this to me? Why are we trying to make it harder for AIs to access public info humans can find anyway? We’ll get there either way.

8 comments

armchairhacker2 months ago
It&#x27;s (usually) not about restricting access. Many AI scrapers are poorly implemented: they overload servers with requests, effectively DOS-ing the server and preventing anyone (including normal users and even themselves) from accessing it.<p>Many of the AI scraper&#x27;s requests also point to non-existent, redundant, or low-quality destinations. Websites provide a file, &#x2F;robots.txt, that clearly indicate what URLs crawlers should and should not visit; but the AI scrapers ignore robots.txt, visiting any URL they find, and some they invent (which, naturally, turn out to be non-existent). Websites also indicate when the content at a specific URL has or may change; but AI scrapers ignore those indicators too, requesting the same URL for a static webpage sometimes seconds apart.<p><a href="https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;ai-labyrinth&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.cloudflare.com&#x2F;ai-labyrinth&#x2F;</a> specifies that it works against &quot;<i>unauthorized</i> crawling&quot; and &quot;<i>inappropriate</i> bot activity&quot;. I assume that a scraper (even an AI scraper) that respects robots.txt and doesn&#x27;t send requests at unreasonable rates won&#x27;t encounter the AI labyrinth.
评论 #43439777 未加载
dave44202 months ago
Because AI scrapers are indistinguishable from a DDoS attack. See e.g. <a href="https:&#x2F;&#x2F;status.sr.ht&#x2F;issues&#x2F;2025-03-17-git.sr.ht-llms&#x2F;" rel="nofollow">https:&#x2F;&#x2F;status.sr.ht&#x2F;issues&#x2F;2025-03-17-git.sr.ht-llms&#x2F;</a>
mtmail2 months ago
Website owners create a robots.txt file to guide crawlers&#x2F;scrapers. Some URLs might be slow (lots of data&#x2F;database access) to generate, some URLs irrelevant to any search, some URLs might be outdated or duplicate. Or under copyright or other usage restriction.<p>Now we see the crawlers ignore the robots.txt.<p>Some crawlers don&#x27;t do 1 request per second but hit a website with 100 per second. And for days. And crawl the same data again and again. It makes websites slow and has no immediate benefit to the humans.
throwawayffffas2 months ago
While I agree with you on sentiment, bandwidth and the compute that power the websites these scrapers cost money, most people that have websites cannot afford to provision the capacity required to serve these requests.<p>Additionally the objective of these pages is to serve ads to real people or funnel real people to paid products, AI traffic for them is just a cost at best and a denial of service attack at worst.
9999000009992 months ago
Let&#x27;s say I noticed a lot of people in my neighborhood don&#x27;t have fresh apples.<p>I own an apple farm, so I don&#x27;t mind leaving out a few apples boxed up and ready to go. For sake of argument, just assume that these apples are fine but by the time they could be transported for sale they wouldn&#x27;t be fresh.<p>For the first two years of doing this, most people would come and pick up a couple of apples and then go home. In the last two months Jimbo pulls up a truck and dumps as many apples as possible into the back and drives off.<p>Eventually I&#x27;m going to have to tell Jimbo to stop doing this or at least charge a fee for each apple. Otherwise Jim is the only one who gets any.
评论 #43441085 未加载
JohnFen2 months ago
There are a number of different reasons. Some object to what is effectively a DDOS attack many of these bots are engaging in. Others (I&#x27;m in this camp) don&#x27;t want their data to be used for training these models. Some are just resisting the bad behavior of AI companies that ignore robots.txt, and etc.
scblock2 months ago
This discussion and post from yesterday has some good perspective: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43422413">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=43422413</a>
mattl2 months ago
I don’t want automated tools hitting the website and using a ton of bandwidth. A human being is not going to be able to click around all the pages on the site that quickly.<p>Also the data on the site may prohibit being used for AI slop.