TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Tell HN: We should start to add “ai.txt” as we do for “robots.txt”

562 pointsby Jeannenabout 2 years ago
I started to add an ai.txt to my projects. The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.<p>It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.

75 comments

samwillisabout 2 years ago
Using robots.txt as a model for anything doesn&#x27;t work. All a robots.txt is is a polite request to please follow the rules in it, there is no &quot;legal&quot; agreement to follow those rules, only a moral imperative.<p>Robots.txt has failed as a system, if it hadn&#x27;t we wouldn&#x27;t have captchas or Cloudflare.<p>In the age of AI we need to better understand where copyright applies to it, and potentially need reform of copyright to align legislation with what the public wants. We need test cases.<p>The thing I somewhat struggle with is that after 20-30 years of calls for shorter copyright terms, lesser restrictions on content you access publicly, and what you can do with it, we are now in the situation where the arguments are quickly leaning the other way. &quot;We&quot; now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...<p>In many ways an ai.txt would be worse than doing nothing as it&#x27;s a meaningless veneer that would be ignored, but pointed to as the answer.
评论 #35887648 未加载
评论 #35886826 未加载
评论 #35886858 未加载
评论 #35887206 未加载
评论 #35888346 未加载
评论 #35887544 未加载
评论 #35891315 未加载
评论 #35887550 未加载
评论 #35888342 未加载
评论 #35886841 未加载
评论 #35888604 未加载
评论 #35886788 未加载
评论 #35889105 未加载
评论 #35895234 未加载
评论 #35887131 未加载
评论 #35889154 未加载
评论 #35890277 未加载
评论 #35890279 未加载
评论 #35892326 未加载
评论 #35889292 未加载
评论 #35889832 未加载
评论 #35890233 未加载
评论 #35891050 未加载
评论 #35887866 未加载
评论 #35888677 未加载
评论 #35888132 未加载
评论 #35886711 未加载
评论 #35889333 未加载
评论 #35887306 未加载
评论 #35891711 未加载
评论 #35886748 未加载
qbasic_foreverabout 2 years ago
Your HTML already has semantic meta elements like author and description you should be populating with info like that: <a href="https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Learn&#x2F;HTML&#x2F;Introduction_to_HTML&#x2F;The_head_metadata_in_HTML" rel="nofollow">https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Learn&#x2F;HTML&#x2F;Introduc...</a>
评论 #35888217 未加载
评论 #35890492 未加载
matsemannabout 2 years ago
Reading the title I thought you meant the opposite.<p>Aka, an ai.txt file that disallow ai to train or use your data similar to robots.txt (but for cases when you still want to be crawled, just not extrapolated)
评论 #35887589 未加载
评论 #35887210 未加载
评论 #35890755 未加载
评论 #35891564 未加载
nstjabout 2 years ago
“Google Search works hard to understand the content of a page. You can help us by providing explicit clues about the meaning of a page to Google by including structured data on the page.”[0]<p>[0]: <a href="https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;docs&#x2F;appearance&#x2F;structured-data&#x2F;intro-structured-data" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;docs&#x2F;appearance&#x2F;structu...</a>
nforgeritabout 2 years ago
Why are we so defensive concerning human created content vs robot created content? Do we really need to feel frightened by some gpt?<p>Whilst the output of AI is astonishing by itself, is it really creating meaningful content en masse? I see myself relying more and more on human-curated content because typical commercialized use cases of AI generated stuff (product descriptions, corp blogs, SEO landing pages, etc.) all read like meaningless blabber, to me at least.<p>Whenever I see some cool techbro boasting how he created his &quot;SEO factory&quot; using ChatGPT, I can&#x27;t help but think that the poor guy is shitting where he eats without even realizing it. Take Google with their Search and Ads; over the last decade they managed to bring down overall quality of web content that much, that I&#x27;m just completely fed up using it because by 99% chance I&#x27;ll land on some meaningless SEO page.<p>From what I can perceive with things like HN, Mastodon, etc. it feels more like a rejuvenation of the human centric brand trusted Web. And by that I mean: Dear crawler, just use my content. Maybe you can do something good with it, maybe not. But chances are low, it&#x27;s gonna replace me in any way but rather improve my content. It only leads to a downward spiral if we stick with the past of commercial thinking (more cheap content, more followers, more ads); if we&#x27;d instead switch to subscription models individuals won&#x27;t get rich but we&#x27;d have a great ecosystem of ideas and content again.
TechBro8615about 2 years ago
If AI is using training data from your site, presumably it got that data by crawling it. So either it&#x27;s already respecting robots.txt, in which case ai.txt would be redundant, or it&#x27;s ignoring it, in which case there&#x27;s no reason to expect it would respect ai.txt any more than it did robots.txt.
评论 #35888298 未加载
评论 #35895061 未加载
jedbergabout 2 years ago
&gt; it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.<p>Why would the crawler trust you to be accurate instead of just figuring it out for itself?<p>Besides, they want to hoover up all the data for their training set anyway.
hombre_fatalabout 2 years ago
What problem is this solving? Also why would anyone trust your description of your own site instead of just looking at your homepage? This is the same reason why other self description directives failed and why search engines just parse your content for themselves, something LLMs have no trouble with.<p>Why would I make a request to your low trust self description when I can make one to your homepage?
aww_dangabout 2 years ago
Most of what you listed is already covered by existing meta tags and structured data.<p><a href="https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Learn&#x2F;HTML&#x2F;Introduction_to_HTML&#x2F;The_head_metadata_in_HTML#adding_an_author_and_description" rel="nofollow">https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Learn&#x2F;HTML&#x2F;Introduc...</a><p><a href="https:&#x2F;&#x2F;schema.org&#x2F;author" rel="nofollow">https:&#x2F;&#x2F;schema.org&#x2F;author</a><p><a href="https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;blog&#x2F;2013&#x2F;08&#x2F;relauthor-frequently-asked-advanced#:~:text=rel%3Dauthor%20helps%20individuals%20(authors,completely%20independent%20of%20one%20another" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;blog&#x2F;2013&#x2F;08&#x2F;relauthor-...</a>.
jeroenhdabout 2 years ago
If AI needs explicit information and context, surely it should focus on improving its context recognition rather than trying to fix that by inserting even more training data.<p>Regardless, I do agree that something like a robots.txt for AI can be very useful. I&#x27;d like my website to be excluded from most AI projects and some kind of standardized way to communicate this preference would be nice, although I realize most AI projects don&#x27;t exactly care about things like the wishes of authors, copyright, or ethical considerations. It&#x27;s the idea that matters, really.<p>If I can use an ai.txt to convince the crawlers that my website contains illegal hardcore terrorist pornography to get it excluded from the datasets, that&#x27;s another way to accomplish this I suppose.
评论 #35889911 未加载
nashashmiabout 2 years ago
What you have described is something akin to what meta tags are for. Do we need another method at a domain or subdomain level? Plus, robots.txt, etc. is limited to domain and subdomain managers.<p>ai.txt is useful, but I am not sure we have nailed down what it can be used for. One use is to tell AI not to train on the content found within because it could be an AI generation.
kriroabout 2 years ago
I&#x27;m curious what the legal ramifications of adding &quot;this code is not to be used for any ML algorithms, failure to adhere to this will result in a fine of at least one million dollars&quot; (in smarter writing) to a software license would be. Seems like a dumb idea&#x2F;not enforcable but maybe someone with software licensing knowledge can chime in.
评论 #35888454 未加载
评论 #35888756 未加载
sphabout 2 years ago
<p><pre><code> # cat &gt; &#x2F;var&#x2F;www&#x2F;.well-known&#x2F;ai.txt Disallow: * ^D # systemctl restart apache2 </code></pre> Until then, I&#x27;m seriously considering prompt injection in my websites to disrupt the current generation of AI. Not sure if it would work.<p>Please share with me ideas, links and further reading about adversarial anti-AI countermeasures.<p>EDIT: I&#x27;ve made an Ask HN for this: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35888849" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=35888849</a>
评论 #35890402 未加载
ciexabout 2 years ago
If any kind of common URL is established, it should not be served from root but a location like `&#x2F;.well-known&#x2F;{ai,robots,meta,whatever}.txt` in order not to clobber the root namespace.
winddudeabout 2 years ago
&gt; It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.<p>How do you differentiate an AI crawler from a normal crawler? Almost all of the LLMs are trained on commoncrawl, which the concept of LLMs didn&#x27;t even exist when CC started. What about a crawler that creates a search database, but&#x27;s context is fed into a LLM as context? Or a middleware that fetches data in real time?<p>Honestly that&#x27;s a terrible idea. and robots.txt can cover the use cases. But is still pretty ineffective, because it&#x27;s more just a set of suggestions than rules that must be followed.
westurnerabout 2 years ago
security.txt <a href="https:&#x2F;&#x2F;github.com&#x2F;securitytxt&#x2F;security-txt">https:&#x2F;&#x2F;github.com&#x2F;securitytxt&#x2F;security-txt</a> :<p>&gt; <i>security.txt provides a way for websites to define security policies. The security.txt file sets clear guidelines for security researchers on how to report security issues. security.txt is the equivalent of robots.txt, but for security issues.</i><p>Carbon.txt: <a href="https:&#x2F;&#x2F;github.com&#x2F;thegreenwebfoundation&#x2F;carbon.txt">https:&#x2F;&#x2F;github.com&#x2F;thegreenwebfoundation&#x2F;carbon.txt</a> :<p>&gt; <i>A proposed convention for website owners and digital service providers to demonstrate that their digital infrastructure runs on green electricity.</i><p>&quot;Work out how to make it discoverable - well-known, TXT records or root domains&quot; <a href="https:&#x2F;&#x2F;github.com&#x2F;thegreenwebfoundation&#x2F;carbon.txt&#x2F;issues&#x2F;3#issuecomment-918656777">https:&#x2F;&#x2F;github.com&#x2F;thegreenwebfoundation&#x2F;carbon.txt&#x2F;issues&#x2F;3...</a> re: JSON-LD instead of txt, <i>signed</i> records with W3C Verifiable Credentials (and blockcerts&#x2F;cert-verifier-js)<p>SPDX is a standard for specifying software licenses (and now SBOMs Software Bill of Materials, too) <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Software_Package_Data_Exchange" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Software_Package_Data_Exchange</a><p>It would be transparent to disclose the SBOM in AI.txt or elsewhere.<p>How many parsers should be necessary for <a href="https:&#x2F;&#x2F;schema.org&#x2F;CreativeWork" rel="nofollow">https:&#x2F;&#x2F;schema.org&#x2F;CreativeWork</a> <a href="https:&#x2F;&#x2F;schema.org&#x2F;license" rel="nofollow">https:&#x2F;&#x2F;schema.org&#x2F;license</a> metadata for resources with (Linked Data) URIs?
评论 #35888585 未加载
kordlessagainabout 2 years ago
robots.txt is for all crawlers, so there&#x27;s no need for another file? robots.txt supports comments using # and ideally has a link to the site map, which would tell any robot crawler where the important bits live on the site.<p>Putting a good comment at the top of robots.txt would be just as good as any other solution, given it could serve as a type of prompt template for processing the data on the site it represents.
TheRealPomaxabout 2 years ago
Feels like a setup for &quot;and then we can blame people for not having an ai.txt when we rip their entire back catalog&quot;.
phkahlerabout 2 years ago
Isn&#x27;t an AI a robot? Even if we do this, it should be in robots.txt
mxuribeabout 2 years ago
@Jeannen I really like the thinking here...But instead of ai.txt - since the intent is not to block, but rather, to inform AI models (or any other presumably automaton) - my reflex is to suggest something more general like readme.txt. But, then i thought, well, since its really more about metadata, as others have stated, there might already be existing standards...Or, at least, common behaviors that could become standardized. For example, someone noted about security.txt, and i know there&#x27;s the humans.txt approach (see <a href="https:&#x2F;&#x2F;humanstxt.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;humanstxt.org&#x2F;</a>), and of course there are web manifest files (see <a href="https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;Manifest" rel="nofollow">https:&#x2F;&#x2F;developer.mozilla.org&#x2F;en-US&#x2F;docs&#x2F;Web&#x2F;Manifest</a>), etc. I wonder if you might want to consider reviewing existing approaches, and maybe augment them or see if any of thjose makese sense (or not)...?
arwineapabout 2 years ago
What if we create a new access.txt which all user agents will use to get access to the resources.<p>access.txt will return an individual access key for the user agent like a session, and the user agent can only crawl using the access key<p>This would mean that we could standardize session starts with rate limits. Regular user is unlikely to hit the user rate limits, but bots would get rocked by rate limiting.<p>Great. Now authorized crawlers, bing, google, etc, all use PKI so that they can sign the request to access.txt to get their access key. If the access.txt request is signed with a known crawler the rate limits can be loosened to levels that a crawler will enjoy<p>This will allow users &#x2F; browsers to use normal access patterns without any issue, but crawlers will have to request elevated rate limits to perform their tasks. Crawlers and AI alike could be allowed or disallowed by the service owners, which is really what everyone wanted from robots.txt in the first place<p>One issue I see with this already is that it solidifies the existing search engines as the market leaders
评论 #35888515 未加载
h1fraabout 2 years ago
Something like JSON+LD ? It should cover most of your needs and can also be used for actual search engine<p>e.g: <a href="https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;docs&#x2F;appearance&#x2F;structured-data&#x2F;article?hl=fr" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;docs&#x2F;appearance&#x2F;structu...</a>
dingle_thunkabout 2 years ago
Isn&#x27;t an AI a robot?
评论 #35888017 未加载
评论 #35887065 未加载
zzzeekabout 2 years ago
I&#x27;m ready to put an ai.txt right on my site<p><pre><code> Kirk: Everything Harry tells you is a lie. Remember that. Everything Harry tells you is a lie. Harry: Listen to this carefully, Norman. I am lying.</code></pre>
mrigheleabout 2 years ago
I would prefer a more generic &quot;license.txt&quot; i.e. a standard sanctioned way of telling the User Agent that the resource under a certain location is provided with a specific license. Maybe a picture is public domain, maybe is copyrighted but freely distributable, or maybe it is but you cannot train AI on it. Same for code, text, articles etc. The difficult part would be to make it formal enough so that it can easily consumed by robots.<p>With the current situation you either assume that everything is not usable, or you just not care and crawl everything that you can reach.
评论 #35888619 未加载
waffletowerabout 2 years ago
Attempts to muster and legitimize the ownership, squandering and sequestration of The Commons are growing rampant after the recent successes of generative AI. They are a tragic and misguided attempt to lesion, fragment and own the very consistency of The Collective Mind. Individuals and groups already have fairly absolute authority over their information property -- simply choose not to release it to The Commons. If you do not want people to see, sit or sleep on your couch, please keep it locked inside your home.
jasfiabout 2 years ago
I proposed META tags for the same reason. I don&#x27;t think this is going to happen though.
chunk_waffleabout 2 years ago
&gt; some useful info about the website like what it is about, when was it published, the author, etc etc.<p>Aren&#x27;t there already things in place for that info (e.g. meta tags?)
kklisuraabout 2 years ago
Can we start changing our licenses to prohibit usage of a project for training AI systems?
throw9away6about 2 years ago
Why would anyone want ai to train on and monetize your content? If there was a way to block ai stealing content most people would opt to block it.
评论 #35887958 未加载
renewiltordabout 2 years ago
This is the wrong model imho. Humans can figure out a website. We only tire. An AI system does not. But can do the same thing.<p>Additionally, any cooperative attempt won&#x27;t work because humans will attempt to misrepresent themselves.<p>No successful AI system will listen to someone&#x27;s self representation because the AI system does not need proxies: it can act by simply acquiring all recorded observed behaviour.
julielitabout 2 years ago
It is fair to give more information about the information exposed on a website, especially when it comes to partnering with AI systems. There is an international effort which includes such information. It is done under the auspices of the W3C. See <a href="https:&#x2F;&#x2F;www.w3.org&#x2F;community&#x2F;tdmrep&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.w3.org&#x2F;community&#x2F;tdmrep&#x2F;</a>. It has been developed to implement the Text &amp; Data Mining + AI &quot;opt-out&quot; that is legal in Europe. It does not use robots.txt because this one is about indexing a website and should stay focus on it. The information about website managers is contained in the &#x2F;.well-known directory, in a JSON-LD file, which is much more well structured than robots.txt. Why not adhere to an international effort rather than creating N fragmented initiatives?
javierluraschiabout 2 years ago
Related, there is also <a href="https:&#x2F;&#x2F;datatxt.org" rel="nofollow">https:&#x2F;&#x2F;datatxt.org</a>
评论 #35888546 未加载
rchaudabout 2 years ago
This is a well-intentioned thing to do. But I can&#x27;t help but feel that we are way past the point where something like this would even matter.<p>Do search robots even care if you have a &quot;noindex&quot; in your page `&lt;head&gt;`? Do websites care if your browser sends a Do Not Track request?
deafpolygonabout 2 years ago
Shouldn&#x27;t ai respect robots.txt?
评论 #35887095 未加载
评论 #35890087 未加载
sn_masterabout 2 years ago
Wouldn&#x27;t that make the job of spammers easier? They can create very low quality websites but with very high quality (AI Generated?) ai.txt that fools AI engines into trusting them more than other websites with better content.
menro1about 2 years ago
I&#x27;ve started to play with the Ai.txt metaphor, but pushing it closer to the semantic solution mentioned, focusing on Content Extraction and Cleaning. Happy to share the file example if anyone is interested.
fredrik_skne_seabout 2 years ago
If anyone want to use my blog posts, they can contact me. I want to know my customer.<p>If you want to know about copyright that applies to my work: <a href="https:&#x2F;&#x2F;www.riksdagen.se&#x2F;sv&#x2F;dokument-lagar&#x2F;dokument&#x2F;svensk-forfattningssamling&#x2F;lag-1960729-om-upphovsratt-till-litterara-och_sfs-1960-729" rel="nofollow">https:&#x2F;&#x2F;www.riksdagen.se&#x2F;sv&#x2F;dokument-lagar&#x2F;dokument&#x2F;svensk-f...</a><p>Beeing in the US does not shield you from my country&#x27;s laws. You are not allowed to copy my work without my permission, you are not allowed to transform it.
keirabeeabout 2 years ago
I think really most sites should (ideally) come with a text-only version. I know that&#x27;s probably an extreme minority opinion but between console-based browsers, screenreaders, people with crappy devices, people with miniature devices, at the very least just having some kind of &#x27;about this site&#x27; document would be helpful for anyone. There seems to be overlap between that need and this, possibly. Then again, having it in some format like json (or xml) might also be more &#x27;accessible&#x27; to machines (and to certain devices).
dmcq2about 2 years ago
I just today removed some disallow directives from robots.txt files and put in noidex metas on the pages instead like Google recommends. It doesn&#x27;t really have much use nowadays.<p>As to copyright - yes I agree the Micky Mouse copyright law has been extended far too long and should be about thirty years. On the other hand I think trade marks should not be nowhere so easily liable to be lost even if people do use the term generally. Disney should still be able to make new Micky Mouse cartoons and be defended from others making them.
bryanrasmussenabout 2 years ago
aside from the other comments here - robots.txt does work to some extent because it tells the crawler something it might be useful for the crawler to know - if you have blocked it from crawling part of yur site it might be actually beneficial to the crawler to follow that restriction (to be a good citizen) because if it doesn&#x27;t you might block it by seeing a user agent showing up a part of the site it shouldn&#x27;t.<p>AI.txt doesn&#x27;t have this feedback to the AI to improve it. Also it seems likely users might have reason to lie.
caturopathabout 2 years ago
&gt; The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.<p>How does this differ from what would be useful in humans.txt?
Goofy_Coyoteabout 2 years ago
Glad to see this here, lots of great points here. I&#x27;m working on a spec for this specific usecase, reading the comments here pointed out a few flaws in my model already.
sebastianconcptabout 2 years ago
It&#x27;s impossible.<p>The problem is that such ai.txt would be an unidimensional opinion based on what? On the way the site describes itself. So a self-referencing source.<p>But the AIs reading it, are precisely going to <i>invariably</i> be trained with different world views that <i>will</i> summarize and express opinions biased by these worldviews. It&#x27;s even deeper as every worldview can&#x27;t help but belong to one ideology or another.<p>So who is aligned with truth now?<p>The author? AI1? AI2? AI3?...AIN?<p>We&#x27;re in such a mess.
tlrobinsonabout 2 years ago
If there’s one thing LLMs are pretty good at it’s summarizing content. Shouldn’t your website just have an “About” page with this information that humans can read too?
评论 #35888241 未加载
__w1kke___about 2 years ago
Put a blockchain wallet address or even multiples on different blockchains in the ai.txt to collect your shares of what the AI makes from your data + website. This is a fair way to solve the attribution problem. Similar to the robots.txt file this is not a hard enforcement but a way how responsible AI can differentiate itself from the rest.
Animatsabout 2 years ago
At this point, all the good content has been sucked into LLM training sets. Other than a need to keep up with current events, there&#x27;s no point in crawling more of the web to get training data.<p>There&#x27;s a downside to dumping vast amounts of crap content into an LLM training set. The training method has no notion of data quality.
jruohonenabout 2 years ago
A better idea along the same lines: RFC 5785.
评论 #35887699 未加载
评论 #35886411 未加载
escape_goatabout 2 years ago
I don&#x27;t think that it would be wise for anyone to rely on such practices. Even with the best of intentions, obsolescence and unintentional misdirection are strong possibilities. Considering normative intentions, it is an invitation for &quot;optimization&quot; attempts by websites presenting contested information.
annoyingnoobabout 2 years ago
Do we need more features that are generally ignored? What has robots.txt gotten us? What has Do Not Track gotten us?
评论 #35888675 未加载
theandrewbaileyabout 2 years ago
Can you give a live example? What is in this ai.txt that isn&#x27;t in an about page that almost every site has?
评论 #35984477 未加载
hoshabout 2 years ago
Although, there is such a thing as Semantic Web, where such information can be embedded within a page.
mirkodrummerabout 2 years ago
Why ending in a training dataset would be great I don’t understand. I mean what’s the point of having a website at all if the user find what they’ve been looking for on another UI that’s been trained with your content and that’s not your website?
lofaszvanittabout 2 years ago
AI belongs to governments, not over trillion companies. Sorry, but we have to wreste the AI thing out from their arms. It&#x27;s working on people&#x27;s output, so it should be free. Time to nuke the big ones out from the sector.
nottorpabout 2 years ago
It will work exactly as well as robots.txt and the do not track flag.
rhackerabout 2 years ago
We should piss off google and standardize around chatgpt.txt
Jaxanabout 2 years ago
Why put this in ai.txt? It sounds useful to humans too! Maybe just put “what the site is about” on the homepage, so that everyone benefits.
moimikeyabout 2 years ago
what differentiates this from <a href="https:&#x2F;&#x2F;humanstxt.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;humanstxt.org&#x2F;</a>?
pnemonicabout 2 years ago
Excuse me, we prefer the term &quot;android&quot;.
评论 #35893540 未加载
runamokabout 2 years ago
And ai.txt should have a mechanism for micro (or not so micro) payment. Please deposit .03 X coin into this account to crawl site.
lynx23about 2 years ago
robots.txt was a performance-hack. It never felt like a audience-filter. As sad as it might sound, hoping for filtering on publicly reachable content seems a bit naiv in my book. If you want your stuff not learnt by an AI, you better not publish it. Everything a human can read, an AI eventually will.
undersuitabout 2 years ago
The AI doesn&#x27;t have to follow ai.txt, but it appreciates the effort you put into classifying data for it.
2OEH8eoCRo0about 2 years ago
Why? So that they can both be ignored?
noizejoyabout 2 years ago
This makes about as much sense to me as the old “keywords” HTML meta tag.<p>It will be gamed.
kristianpaulabout 2 years ago
Rate limits and captchas instead ?
anon223345about 2 years ago
Robots.txt is mostly ignored btw
rzrabout 2 years ago
What&#x27;s next? humans.txt ?
AndyMcConachieabout 2 years ago
Semantic web for robots?
acdwabout 2 years ago
$ cat ai.txt no $
counterpartyrskabout 2 years ago
ai is robot, no?
acdwabout 2 years ago
$ cat ai.txt no
AndrewKemendoabout 2 years ago
Feels redundant
villgaxabout 2 years ago
We should add spurious html text instead
anaclumosabout 2 years ago
Some interesting studies on this I&#x27;ve done: <a href="https:&#x2F;&#x2F;cho.sh&#x2F;r&#x2F;F9F706" rel="nofollow">https:&#x2F;&#x2F;cho.sh&#x2F;r&#x2F;F9F706</a><p>Project AIs.txt is a mental model of a machine learning permission system. Intuitively, question this: what if we could make a human-readable file that declines machine learning (a.k.a. Copilot use)? It&#x27;s like robots.txt, but for Copilot.<p>User-agent: OpenAI Disallow: &#x2F;some-proprietary-codebase&#x2F;<p>User-agent: Facebook Disallow: &#x2F;no-way-mark&#x2F;<p>User-agent: Copilot Disallow: &#x2F;expensive-code&#x2F;<p>Sitemap: &#x2F;public&#x2F;sitemap.xml Sourcemap: &#x2F;src&#x2F;source.js.map License: MIT<p># SOME LONG LEGAL STATEMENTS HERE<p>Key Issues Would it be legally binding? For now, no. It would be a polite way to mark my preference to opt-out of such data mining. It&#x27;s closer to the Ask BigTechs Not to Track option rather than a legal license. Technically, Apple&#x27;s App Tracking Transparency does not ban all tracking activity; it never can.<p>254AFC.png<p>Why not LICENSE or COPYING.txt? Both are mainly written in human language and cannot provide granular scraping permissions depending on the collector. Also, GitHub Copilot ignores LICENSE or COPYING.txt, claiming we consented to Copilot using our codes for machine learning by signing up and pushing code to GitHub, We may expand the LICENSE system to include the terms for machine learning use, but that would even more edge case and chaotic licensing systems.<p>Does machine learning purposes of copyrighted works require a license? This question is still under debate. Opt-out should be the default if it requires a license, making such a license system meaningless. If it doesn&#x27;t require a license, then which company would respect the license system, given that it is not legally binding?<p>Is robots.txt legally binding? No. Even if you scrape the web prohibited under robots.txt, it is not against the law. See HIQ LABS, INC., Plaintiff-Appellee, v. LINKEDIN CORPORATION, Defendant-Appellant.. robots.txt cannot make fair use illegal.<p>Any industry trends? W3 has been working on robots.txt for machine learning, aligning with EU Copyright Directives.<p>The goal of this Community Group is to facilitate TDM in Europe and elsewhere by specifying a simple and practical machine-readable solution capable of expressing the reservation of TDM rights. w3c&#x2F;tdm-reservation-protocol: Repository of the Text and Data Mining Reservation Protocol Community Group<p>Can we even draw the line? No. One could reasonably argue that AI is doing the same as humans, much better and more efficiently. However, that claim goes against the fundamentals of intellectual property. If any IP is legally protected, machine-generated code must also have the same level of awareness system to respect it and prevent any plagiarism. Otherwise, they must bear legal duties.<p>Maybe it can benefit AI companies too ... by excluding all hacky codes and only opting for best-practice codes. If implemented correctly, it can work as an effective data sanitation system.
datavirtueabout 2 years ago
Why?