TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Sites scramble to block ChatGPT web crawler after instructions emerge

71 pointsby spectoalmost 2 years ago

9 comments

Meekroalmost 2 years ago
I have two sites that provide documentation for open source libraries I've created, and I definitely won't be blocking ChatGPT. It has already read my documentation and can correctly answer most StackOverflow-level questions about my libraries' use. This is seriously impressive and very helpful, as far as I'm concerned.
vouaobrasilalmost 2 years ago
From the article:<p>&gt; For example, blocking content from future AI models could decrease a site&#x27;s or a brand&#x27;s cultural footprint if AI chatbots become a primary user interface in the future.<p>I would rather leave the internet entirely if AI chatbots become a primary user interface.
评论 #37096139 未加载
评论 #37099021 未加载
8organicbitsalmost 2 years ago
I wonder if its worth poisoning the replies for scrapers that don&#x27;t obey robots.txt. Send back nonsense, lies, and noise. This would be an adversarial approach like <a href="https:&#x2F;&#x2F;adnauseam.io&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;adnauseam.io&#x2F;</a> uses for ad tracking.
评论 #37095983 未加载
评论 #37098817 未加载
JohnFenalmost 2 years ago
&gt; blocking GPTBot will not guarantee that a site&#x27;s data does not end up training all AI models of the future. Aside from issues of scrapers ignoring robots.txt files, there are other large data sets of scraped websites (such as The Pile) that are not affiliated with OpenAI.<p>This is why I&#x27;m not reassured. robots.txt isn&#x27;t sufficient to stop all webcrawlers, so there every reason to think it isn&#x27;t sufficient to stop AI scrapers.<p>I&#x27;m still wanting to find a good solution to this problem so that I can open my sites up to the public again.
评论 #37095405 未加载
评论 #37095066 未加载
评论 #37097749 未加载
wildpeaksalmost 2 years ago
This gives the illusion of being in control, but if enough people block the bot, they&#x27;ll just scrape differently (if they don&#x27;t already) because too much money is at stake, more than whatever fine they may get if they do get caught and can&#x27;t settle out of court, not to mention they may consider it will be someone else&#x27;s problem by then.<p>It&#x27;s more pragmatic to expect that any data that can be accessed one way or another will be scraped because interests aren&#x27;t aligned between content authors and scrapers.<p>On the other hand, robots.txt was benefiting both search engines and content authors because it signaled data that wasn&#x27;t useful to show in search results, therefore search engines had an incentive to follow its rules.
评论 #37099614 未加载
blibblealmost 2 years ago
blocked it on every single site I manage<p>there is zero benefit to me in allowing OpenAI to absorb my content<p>it is a parasite, plain and simple (as is GitHub Copilot)<p>and I&#x27;ll be hooking in the procedurally generated garbage pages for it soon!
评论 #37095923 未加载
评论 #37095319 未加载
karaterobotalmost 2 years ago
The article does not say whether it obeys `User-agent: *`. My guess is that, if it doesn&#x27;t respect that, it doesn&#x27;t truly respect `User-agent: GPTBot` either.
askvictoralmost 2 years ago
I&#x27;ve been reading lots of datasheets and application notes in the embedded space recently. Most of these are only accessible after creating a (free) login. In one sense, it&#x27;s a reasonably simple way to prevent scraping like this (at least until the AI-based scrapers can generate their own logins). On the other hand, a lot of that kind of material would be _really_ useful to be able to ask an LLM about.
CableNinjaalmost 2 years ago
For anyone reading this, you can skip the robots.txt, as others have pointed out, who knows if they will actually listen to it.<p>Instead, use a redirect or return a response code by doing a user agent check in your server config. I posted elsewhere in this thread on the way i did it with nginx
评论 #37098587 未加载