TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Does ChatGPT respect Robots.txt?

36 pointsby iKevinShahabout 2 years ago
Hello HN,<p>Was looking for more info on whether ChatGPT (and similar LLMs) respect the robots.txt directives?<p>I couldn&#x27;t find any details over the crawlers and the rules they follow when ChatGPT was undergoing its training.<p>When I asked ChatGPT - is says &quot;As an AI language model, I do not have the ability to crawl the web on my own. However, as a general principle, web crawlers should follow the rules specified in the website&#x27;s&quot; and when asked further on who created the data set and method they used - The response is &quot;The dataset used to train me was created by OpenAI, the organization that developed and maintains my system. OpenAI&#x27;s team of researchers and engineers collected the training data from a wide variety of sources, including books, articles, websites, and other publicly available text data.&quot;<p>So - No clear answer as far as I can tell. It is obviously a near-impossible task to build the dataset manually (given the strength of the team) - Some crawlers would have been used (assumption) - If anyone knows or can shed light to it - It would be great.

8 comments

wildpeaksabout 2 years ago
Let&#x27;s not forget that ChatGPT can lie: just because it says something doesn&#x27;t make it true.<p>It&#x27;s more realistic to assume that any data a company is able to access will get gobbled up sooner or later because there is no real penalty for ignoring robots.txt or licenses at their scale: even if someone were to notice an infraction and has enough money to sue them for years, they can afford it and brush it off as the cost of doing business (and if it&#x27;s not ChatGPT, then another model, the cat&#x27;s out of the bag now).<p>A robots.txt gives as much protection as a &quot;please do not hack me&quot; text file against a ransonware.
评论 #35033476 未加载
评论 #35037954 未加载
tlbabout 2 years ago
It&#x27;s trained on data crawled from the web, using CommonCrawl among other sources. See <a href="https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;following-instructions-human-feedback&#x2F;blob&#x2F;main&#x2F;model-card.md">https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;following-instructions-human-feedb...</a>. CommonCrawl certainly respects robots.txt, and I assume the other sources do too.
LinuxBenderabout 2 years ago
I think the best way to answer the question would be to test it out. Have ChatGPT learn something from a URL that is forbidden by robots.txt.<p>FWIW google does not respect robots.txt in the way people think they do. The will still crawl and index a resource but will not publicly display it. Same for archive.org. I&#x27;ve verified that numerous times. Let archive.org index a thing that has always been forbidden by robots.txt and then after some time take the site down. Once robots.txt is no longer reachable archive will start displaying content that was always forbidden <i>per robots.txt</i>. All bots follow the pirate code. <i>A bot will do what a bot &gt;can&lt; do...</i><p>If a resource is meant to be less-than-public it must be behind authentication that bots can not bypass even with the assistance of a human using an addon. Translation addons or any addons using the cloud are an easy way to bypass authentication.
评论 #35037766 未加载
amlutoabout 2 years ago
Why would one expect ChatGPT to know the answer to this question? ChatGPT “knows” what it was trained on. The training data is unlikely to include a definitive answer to your question. And ChatGPT is not currently smart enough to do the kind of analysis that would determine the answer, nor is it likely to be able to do the kind of queries that would be needed to figure it out.
评论 #35036834 未加载
kbelderabout 2 years ago
&gt;&quot;I asked ChatGTP...&quot;<p>&quot;On two occasions I have been asked, &#x27;Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?&#x27; I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.&quot;<p>-Charles Babbage
notahackerabout 2 years ago
I think the more important question is whether the notion of &quot;indexing&quot; for search purposes accurately describes what a corpus of data that never refers back to its source material does.<p>Ideally, you&#x27;d have an updated version of Robots.txt which specified whether it allowed agents to use content in training LLMs or not, which different content publishers would set differently for different reasons
gorbyparkabout 2 years ago
I don&#x27;t know if they did their own crawling as well, but at least part of the training set for GPT-3 was Common Crawl data. You could look up if Common Crawl respects Robots.txt or not.
iKevinShahabout 2 years ago
To give more context - I asked further:<p>me: If you did not crawl - where did you learn about the data? C: As an AI language model, my learning is based on a pre-existing dataset of text. This dataset is created by training me on a large corpus of text, which can include anything from books and articles to websites and social media posts. This training data is processed and used to develop a model that can understand natural language and generate responses to questions and queries.<p>It is important to note that my training data is not limited to any one source, and it may include information from a variety of sources and contexts. This means that I have been exposed to a wide range of information, including information from websites that I may not have crawled myself. However, the information that I provide is based solely on my training data and my ability to understand and interpret natural language.<p>me: The said dataset - Who created it. and were crawlers used for it? C: The dataset used to train me was created by OpenAI, the organization that developed and maintains my system. OpenAI&#x27;s team of researchers and engineers collected the training data from a wide variety of sources, including books, articles, websites, and other publicly available text data.<p>In the process of collecting this data, web crawlers may have been used to collect text from websites. However, it&#x27;s important to note that OpenAI and other responsible organizations adhere to ethical and legal standards when collecting data, and they respect the terms of service and other policies of websites that they crawl. This means that they take steps to ensure that they are not overloading servers with excessive requests or crawling sensitive information that should not be made public. Additionally, they make efforts to anonymize any data that is collected so that individual users or websites cannot be identified.<p>me: It is good to know that you respect the service and policies but is it documented anywhere publicly? C: Yes, OpenAI has publicly documented its policies and practices related to data collection, as well as its commitment to ethical and responsible AI development. These policies and practices are detailed on OpenAI&#x27;s website, and the organization is transparent about its approach to data collection and AI development.<p>For example, OpenAI has published a set of ethical guidelines for AI development that emphasize transparency, fairness, and safety. The organization also participates in industry-wide initiatives focused on responsible AI development, such as the Partnership on AI and the Responsible AI License.<p>In addition, OpenAI has developed a number of tools and technologies designed to promote transparency and accountability in AI systems, including tools for interpreting and visualizing machine learning models, and technologies for measuring the potential impact of AI systems on society.<p>Overall, OpenAI is committed to responsible and ethical AI development, and the organization is transparent about its policies and practices related to data collection, AI development, and the responsible use of AI technologies.<p>Me: ok- can you share me the link for that? C: Certainly! Here is the link to OpenAI&#x27;s website where you can find more information about the organization&#x27;s policies and practices related to AI development, data collection, and responsible use of AI technologies:<p><a href="https:&#x2F;&#x2F;openai.com&#x2F;about&#x2F;ethics-safety&#x2F;" rel="nofollow">https:&#x2F;&#x2F;openai.com&#x2F;about&#x2F;ethics-safety&#x2F;</a><p>This page provides an overview of OpenAI&#x27;s approach to responsible AI development and links to more detailed information about the organization&#x27;s policies and practices, as well as its tools and technologies for promoting transparency and accountability in AI systems.<p>Me: That link is 404Error C: I&#x27;m sorry about that! It looks like the link may have changed. Here&#x27;s the updated link to OpenAI&#x27;s website where you can find information about the organization&#x27;s ethics and safety policies:<p><a href="https:&#x2F;&#x2F;openai.com&#x2F;about&#x2F;mission-and-values&#x2F;ethics-and-safety&#x2F;" rel="nofollow">https:&#x2F;&#x2F;openai.com&#x2F;about&#x2F;mission-and-values&#x2F;ethics-and-safet...</a><p>This page provides an overview of OpenAI&#x27;s approach to ethical and responsible AI development, including its commitment to transparency, safety, and avoiding potential negative impacts of AI technologies. It also includes links to more detailed information about the organization&#x27;s policies and practices.<p>Me: That is 404 as well C: I apologize for the confusion! OpenAI has recently made some changes to its website, and some of the links may have changed. Here&#x27;s the updated link to the page that provides an overview of OpenAI&#x27;s mission, values, and ethics:<p><a href="https:&#x2F;&#x2F;openai.com&#x2F;about&#x2F;mission&#x2F;" rel="nofollow">https:&#x2F;&#x2F;openai.com&#x2F;about&#x2F;mission&#x2F;</a><p>From this page, you can access more information about the organization&#x27;s values and ethics, including its commitment to transparency and responsible AI development. Additionally, you can explore the rest of the site to learn more about OpenAI&#x27;s research, technology, and initiatives.<p>---<p>That is a404 too so I decided to ask experts here :)
评论 #35031717 未加载
评论 #35035279 未加载