TechEcho

8 comments

It's (usually) not about restricting access. Many AI scrapers are poorly implemented: they overload servers with requests, effectively DOS-ing the server and preventing anyone (including normal users and even themselves) from accessing it.Many of the AI scraper's requests also point to non-existent, redundant, or low-quality destinations. Websites provide a file, /robots.txt, that clearly indicate what URLs crawlers should and should not visit; but the AI scrapers ignore robots.txt, visiting any URL they find, and some they invent (which, naturally, turn out to be non-existent). Websites also indicate when the content at a specific URL has or may change; but AI scrapers ignore those indicators too, requesting the same URL for a static webpage sometimes seconds apart.<a href="https://blog.cloudflare.com/ai-labyrinth/" rel="nofollow">https://blog.cloudflare.com/ai-labyrinth/</a> specifies that it works against "unauthorized crawling" and "inappropriate bot activity". I assume that a scraper (even an AI scraper) that respects robots.txt and doesn't send requests at unreasonable rates won't encounter the AI labyrinth.

评论 #43439777 未加载

dave44202 months ago

Because AI scrapers are indistinguishable from a DDoS attack. See e.g. <a href="https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/" rel="nofollow">https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/</a>

mtmail2 months ago

Website owners create a robots.txt file to guide crawlers/scrapers. Some URLs might be slow (lots of data/database access) to generate, some URLs irrelevant to any search, some URLs might be outdated or duplicate. Or under copyright or other usage restriction.Now we see the crawlers ignore the robots.txt.Some crawlers don't do 1 request per second but hit a website with 100 per second. And for days. And crawl the same data again and again. It makes websites slow and has no immediate benefit to the humans.

throwawayffffas2 months ago

While I agree with you on sentiment, bandwidth and the compute that power the websites these scrapers cost money, most people that have websites cannot afford to provision the capacity required to serve these requests.Additionally the objective of these pages is to serve ads to real people or funnel real people to paid products, AI traffic for them is just a cost at best and a denial of service attack at worst.

9999000009992 months ago

Let's say I noticed a lot of people in my neighborhood don't have fresh apples.I own an apple farm, so I don't mind leaving out a few apples boxed up and ready to go. For sake of argument, just assume that these apples are fine but by the time they could be transported for sale they wouldn't be fresh.For the first two years of doing this, most people would come and pick up a couple of apples and then go home. In the last two months Jimbo pulls up a truck and dumps as many apples as possible into the back and drives off.Eventually I'm going to have to tell Jimbo to stop doing this or at least charge a fee for each apple. Otherwise Jim is the only one who gets any.

评论 #43441085 未加载

JohnFen2 months ago

There are a number of different reasons. Some object to what is effectively a DDOS attack many of these bots are engaging in. Others (I'm in this camp) don't want their data to be used for training these models. Some are just resisting the bad behavior of AI companies that ignore robots.txt, and etc.

scblock2 months ago

This discussion and post from yesterday has some good perspective: <a href="https://news.ycombinator.com/item?id=43422413">https://news.ycombinator.com/item?id=43422413</a>

mattl2 months ago

I don’t want automated tools hitting the website and using a ton of bandwidth. A human being is not going to be able to click around all the pages on the site that quickly.Also the data on the site may prohibit being used for AI slop.

8 comments

armchairhacker2 months ago

评论 #43439777 未加载

dave44202 months ago

mtmail2 months ago

throwawayffffas2 months ago

9999000009992 months ago

评论 #43441085 未加载

JohnFen2 months ago

scblock2 months ago

This discussion and post from yesterday has some good perspective: <a href="https://news.ycombinator.com/item?id=43422413">https://news.ycombinator.com/item?id=43422413</a>

mattl2 months ago

Ask HN: Why Are We Fighting AI Scrapers If the Internet Is Meant to Be Public?

8 comments

Ask HN: Why Are We Fighting AI Scrapers If the Internet Is Meant to Be Public?

8 comments