I started to add an ai.txt to my projects. The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.<p>It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.
Using robots.txt as a model for anything doesn't work. All a robots.txt is is a polite request to please follow the rules in it, there is no "legal" agreement to follow those rules, only a moral imperative.<p>Robots.txt has failed as a system, if it hadn't we wouldn't have captchas or Cloudflare.<p>In the age of AI we need to better understand where copyright applies to it, and potentially need reform of copyright to align legislation with what the public wants. We need test cases.<p>The thing I somewhat struggle with is that after 20-30 years of calls for shorter copyright terms, lesser restrictions on content you access publicly, and what you can do with it, we are now in the situation where the arguments are quickly leaning the other way. "We" now want stricter copyright law when it comes to AI, but at the same time shorter copyright duration...<p>In many ways an ai.txt would be worse than doing nothing as it's a meaningless veneer that would be ignored, but pointed to as the answer.
Your HTML already has semantic meta elements like author and description you should be populating with info like that: <a href="https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/The_head_metadata_in_HTML" rel="nofollow">https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduc...</a>
Reading the title I thought you meant the opposite.<p>Aka, an ai.txt file that disallow ai to train or use your data similar to robots.txt (but for cases when you still want to be crawled, just not extrapolated)
“Google Search works hard to understand the content of a page. You can help us by providing explicit clues about the meaning of a page to Google by including structured data on the page.”[0]<p>[0]: <a href="https://developers.google.com/search/docs/appearance/structured-data/intro-structured-data" rel="nofollow">https://developers.google.com/search/docs/appearance/structu...</a>
Why are we so defensive concerning human created content vs robot created content? Do we really need to feel frightened by some gpt?<p>Whilst the output of AI is astonishing by itself, is it really creating meaningful content en masse? I see myself relying more and more on human-curated content because typical commercialized use cases of AI generated stuff (product descriptions, corp blogs, SEO landing pages, etc.) all read like meaningless blabber, to me at least.<p>Whenever I see some cool techbro boasting how he created his "SEO factory" using ChatGPT, I can't help but think that the poor guy is shitting where he eats without even realizing it.
Take Google with their Search and Ads; over the last decade they managed to bring down overall quality of web content that much, that I'm just completely fed up using it because by 99% chance I'll land on some meaningless SEO page.<p>From what I can perceive with things like HN, Mastodon, etc. it feels more like a rejuvenation of the human centric brand trusted Web. And by that I mean: Dear crawler, just use my content. Maybe you can do something good with it, maybe not. But chances are low, it's gonna replace me in any way but rather improve my content. It only leads to a downward spiral if we stick with the past of commercial thinking (more cheap content, more followers, more ads); if we'd instead switch to subscription models individuals won't get rich but we'd have a great ecosystem of ideas and content again.
If AI is using training data from your site, presumably it got that data by crawling it. So either it's already respecting robots.txt, in which case ai.txt would be redundant, or it's ignoring it, in which case there's no reason to expect it would respect ai.txt any more than it did robots.txt.
> it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.<p>Why would the crawler trust you to be accurate instead of just figuring it out for itself?<p>Besides, they want to hoover up all the data for their training set anyway.
What problem is this solving? Also why would anyone trust your description of your own site instead of just looking at your homepage? This is the same reason why other self description directives failed and why search engines just parse your content for themselves, something LLMs have no trouble with.<p>Why would I make a request to your low trust self description when I can make one to your homepage?
Most of what you listed is already covered by existing meta tags and structured data.<p><a href="https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduction_to_HTML/The_head_metadata_in_HTML#adding_an_author_and_description" rel="nofollow">https://developer.mozilla.org/en-US/docs/Learn/HTML/Introduc...</a><p><a href="https://schema.org/author" rel="nofollow">https://schema.org/author</a><p><a href="https://developers.google.com/search/blog/2013/08/relauthor-frequently-asked-advanced#:~:text=rel%3Dauthor%20helps%20individuals%20(authors,completely%20independent%20of%20one%20another" rel="nofollow">https://developers.google.com/search/blog/2013/08/relauthor-...</a>.
If AI needs explicit information and context, surely it should focus on improving its context recognition rather than trying to fix that by inserting even more training data.<p>Regardless, I do agree that something like a robots.txt for AI can be very useful. I'd like my website to be excluded from most AI projects and some kind of standardized way to communicate this preference would be nice, although I realize most AI projects don't exactly care about things like the wishes of authors, copyright, or ethical considerations. It's the idea that matters, really.<p>If I can use an ai.txt to convince the crawlers that my website contains illegal hardcore terrorist pornography to get it excluded from the datasets, that's another way to accomplish this I suppose.
What you have described is something akin to what meta tags are for. Do we need another method at a domain or subdomain level? Plus, robots.txt, etc. is limited to domain and subdomain managers.<p>ai.txt is useful, but I am not sure we have nailed down what it can be used for. One use is to tell AI not to train on the content found within because it could be an AI generation.
I'm curious what the legal ramifications of adding "this code is not to be used for any ML algorithms, failure to adhere to this will result in a fine of at least one million dollars" (in smarter writing) to a software license would be. Seems like a dumb idea/not enforcable but maybe someone with software licensing knowledge can chime in.
<p><pre><code> # cat > /var/www/.well-known/ai.txt
Disallow: *
^D
# systemctl restart apache2
</code></pre>
Until then, I'm seriously considering prompt injection in my websites to disrupt the current generation of AI. Not sure if it would work.<p>Please share with me ideas, links and further reading about adversarial anti-AI countermeasures.<p>EDIT: I've made an Ask HN for this: <a href="https://news.ycombinator.com/item?id=35888849" rel="nofollow">https://news.ycombinator.com/item?id=35888849</a>
If any kind of common URL is established, it should not be served from root but a location like `/.well-known/{ai,robots,meta,whatever}.txt` in order not to clobber the root namespace.
> It can be great if the website somehow ends up in a training dataset (who knows), and it can be super helpful for AI website crawlers, instead of using thousands of tokens to know what your website is about, they can do it with just a few hundred.<p>How do you differentiate an AI crawler from a normal crawler? Almost all of the LLMs are trained on commoncrawl, which the concept of LLMs didn't even exist when CC started. What about a crawler that creates a search database, but's context is fed into a LLM as context? Or a middleware that fetches data in real time?<p>Honestly that's a terrible idea. and robots.txt can cover the use cases. But is still pretty ineffective, because it's more just a set of suggestions than rules that must be followed.
security.txt <a href="https://github.com/securitytxt/security-txt">https://github.com/securitytxt/security-txt</a> :<p>> <i>security.txt provides a way for websites to define security policies. The security.txt file sets clear guidelines for security researchers on how to report security issues. security.txt is the equivalent of robots.txt, but for security issues.</i><p>Carbon.txt: <a href="https://github.com/thegreenwebfoundation/carbon.txt">https://github.com/thegreenwebfoundation/carbon.txt</a> :<p>> <i>A proposed convention for website owners and digital service providers to demonstrate that their digital infrastructure runs on green electricity.</i><p>"Work out how to make it discoverable - well-known, TXT records or root domains" <a href="https://github.com/thegreenwebfoundation/carbon.txt/issues/3#issuecomment-918656777">https://github.com/thegreenwebfoundation/carbon.txt/issues/3...</a> re: JSON-LD instead of txt, <i>signed</i> records with W3C Verifiable Credentials (and blockcerts/cert-verifier-js)<p>SPDX is a standard for specifying software licenses (and now SBOMs Software Bill of Materials, too) <a href="https://en.wikipedia.org/wiki/Software_Package_Data_Exchange" rel="nofollow">https://en.wikipedia.org/wiki/Software_Package_Data_Exchange</a><p>It would be transparent to disclose the SBOM in AI.txt or elsewhere.<p>How many parsers should be necessary for <a href="https://schema.org/CreativeWork" rel="nofollow">https://schema.org/CreativeWork</a> <a href="https://schema.org/license" rel="nofollow">https://schema.org/license</a> metadata for resources with (Linked Data) URIs?
robots.txt is for all crawlers, so there's no need for another file? robots.txt supports comments using # and ideally has a link to the site map, which would tell any robot crawler where the important bits live on the site.<p>Putting a good comment at the top of robots.txt would be just as good as any other solution, given it could serve as a type of prompt template for processing the data on the site it represents.
@Jeannen I really like the thinking here...But instead of ai.txt - since the intent is not to block, but rather, to inform AI models (or any other presumably automaton) - my reflex is to suggest something more general like readme.txt. But, then i thought, well, since its really more about metadata, as others have stated, there might already be existing standards...Or, at least, common behaviors that could become standardized. For example, someone noted about security.txt, and i know there's the humans.txt approach (see <a href="https://humanstxt.org/" rel="nofollow">https://humanstxt.org/</a>), and of course there are web manifest files (see <a href="https://developer.mozilla.org/en-US/docs/Web/Manifest" rel="nofollow">https://developer.mozilla.org/en-US/docs/Web/Manifest</a>), etc. I wonder if you might want to consider reviewing existing approaches, and maybe augment them or see if any of thjose makese sense (or not)...?
What if we create a new access.txt which all user agents will use to get access to the resources.<p>access.txt will return an individual access key for the user agent like a session, and the user agent can only crawl using the access key<p>This would mean that we could standardize session starts with rate limits. Regular user is unlikely to hit the user rate limits, but bots would get rocked by rate limiting.<p>Great. Now authorized crawlers, bing, google, etc, all use PKI so that they can sign the request to access.txt to get their access key. If the access.txt request is signed with a known crawler the rate limits can be loosened to levels that a crawler will enjoy<p>This will allow users / browsers to use normal access patterns without any issue, but crawlers will have to request elevated rate limits to perform their tasks. Crawlers and AI alike could be allowed or disallowed by the service owners, which is really what everyone wanted from robots.txt in the first place<p>One issue I see with this already is that it solidifies the existing search engines as the market leaders
Something like JSON+LD ?
It should cover most of your needs and can also be used for actual search engine<p>e.g: <a href="https://developers.google.com/search/docs/appearance/structured-data/article?hl=fr" rel="nofollow">https://developers.google.com/search/docs/appearance/structu...</a>
I'm ready to put an ai.txt right on my site<p><pre><code> Kirk: Everything Harry tells you is a lie. Remember that. Everything Harry tells you is a lie.
Harry: Listen to this carefully, Norman. I am lying.</code></pre>
I would prefer a more generic "license.txt" i.e. a standard sanctioned way of telling the User Agent that the resource under a certain location is provided with a specific license. Maybe a picture is public domain, maybe is copyrighted but freely distributable, or maybe it is but you cannot train AI on it. Same for code, text, articles etc. The difficult part would be to make it formal enough so that it can easily consumed by robots.<p>With the current situation you either assume that everything is not usable, or you just not care and crawl everything that you can reach.
Attempts to muster and legitimize the ownership, squandering and sequestration of The Commons are growing rampant after the recent successes of generative AI. They are a tragic and misguided attempt to lesion, fragment and own the very consistency of The Collective Mind. Individuals and groups already have fairly absolute authority over their information property -- simply choose not to release it to The Commons. If you do not want people to see, sit or sleep on your couch, please keep it locked inside your home.
> some useful info about the website like what it is about, when was it published, the author, etc etc.<p>Aren't there already things in place for that info (e.g. meta tags?)
This is the wrong model imho. Humans can figure out a website. We only tire. An AI system does not. But can do the same thing.<p>Additionally, any cooperative attempt won't work because humans will attempt to misrepresent themselves.<p>No successful AI system will listen to someone's self representation because the AI system does not need proxies: it can act by simply acquiring all recorded observed behaviour.
It is fair to give more information about the information exposed on a website, especially when it comes to partnering with AI systems. There is an international effort which includes such information. It is done under the auspices of the W3C. See <a href="https://www.w3.org/community/tdmrep/" rel="nofollow">https://www.w3.org/community/tdmrep/</a>. It has been developed to implement the Text & Data Mining + AI "opt-out" that is legal in Europe. It does not use robots.txt because this one is about indexing a website and should stay focus on it. The information about website managers is contained in the /.well-known directory, in a JSON-LD file, which is much more well structured than robots.txt.
Why not adhere to an international effort rather than creating N fragmented initiatives?
This is a well-intentioned thing to do. But I can't help but feel that we are way past the point where something like this would even matter.<p>Do search robots even care if you have a "noindex" in your page `<head>`? Do websites care if your browser sends a Do Not Track request?
Wouldn't that make the job of spammers easier? They can create very low quality websites but with very high quality (AI Generated?) ai.txt that fools AI engines into trusting them more than other websites with better content.
I've started to play with the Ai.txt metaphor, but pushing it closer to the semantic solution mentioned, focusing on Content Extraction and Cleaning. Happy to share the file example if anyone is interested.
If anyone want to use my blog posts, they can contact me. I want to know my customer.<p>If you want to know about copyright that applies to my work: <a href="https://www.riksdagen.se/sv/dokument-lagar/dokument/svensk-forfattningssamling/lag-1960729-om-upphovsratt-till-litterara-och_sfs-1960-729" rel="nofollow">https://www.riksdagen.se/sv/dokument-lagar/dokument/svensk-f...</a><p>Beeing in the US does not shield you from my country's laws. You are not allowed to copy my work without my permission, you are not allowed to transform it.
I think really most sites should (ideally) come with a text-only version. I know that's probably an extreme minority opinion but between console-based browsers, screenreaders, people with crappy devices, people with miniature devices, at the very least just having some kind of 'about this site' document would be helpful for anyone. There seems to be overlap between that need and this, possibly.
Then again, having it in some format like json (or xml) might also be more 'accessible' to machines (and to certain devices).
I just today removed some disallow directives from robots.txt files and put in noidex metas on the pages instead like Google recommends. It doesn't really have much use nowadays.<p>As to copyright - yes I agree the Micky Mouse copyright law has been extended far too long and should be about thirty years. On the other hand I think trade marks should not be nowhere so easily liable to be lost even if people do use the term generally. Disney should still be able to make new Micky Mouse cartoons and be defended from others making them.
aside from the other comments here - robots.txt does work to some extent because it tells the crawler something it might be useful for the crawler to know - if you have blocked it from crawling part of yur site it might be actually beneficial to the crawler to follow that restriction (to be a good citizen) because if it doesn't you might block it by seeing a user agent showing up a part of the site it shouldn't.<p>AI.txt doesn't have this feedback to the AI to improve it. Also it seems likely users might have reason to lie.
> The file is just a basic text file with some useful info about the website like what it is about, when was it published, the author, etc etc.<p>How does this differ from what would be useful in humans.txt?
Glad to see this here, lots of great points here.
I'm working on a spec for this specific usecase, reading the comments here pointed out a few flaws in my model already.
It's impossible.<p>The problem is that such ai.txt would be an unidimensional opinion based on what? On the way the site describes itself. So a self-referencing source.<p>But the AIs reading it, are precisely going to <i>invariably</i> be trained with different world views that <i>will</i> summarize and express opinions biased by these worldviews. It's even deeper as every worldview can't help but belong to one ideology or another.<p>So who is aligned with truth now?<p>The author? AI1? AI2? AI3?...AIN?<p>We're in such a mess.
If there’s one thing LLMs are pretty good at it’s summarizing content. Shouldn’t your website just have an “About” page with this information that humans can read too?
Put a blockchain wallet address or even multiples on different blockchains in the ai.txt to collect your shares of what the AI makes from your data + website. This is a fair way to solve the attribution problem. Similar to the robots.txt file this is not a hard enforcement but a way how responsible AI can differentiate itself from the rest.
At this point, all the good content has been sucked into LLM training sets. Other than a need to keep up with current events, there's no point in crawling more of the web to get training data.<p>There's a downside to dumping vast amounts of crap content into an LLM training set. The training method has no notion of data quality.
I don't think that it would be wise for anyone to rely on such practices. Even with the best of intentions, obsolescence and unintentional misdirection are strong possibilities. Considering normative intentions, it is an invitation for "optimization" attempts by websites presenting contested information.
Why ending in a training dataset would be great I don’t understand. I mean what’s the point of having a website at all if the user find what they’ve been looking for on another UI that’s been trained with your content and that’s not your website?
AI belongs to governments, not over trillion companies. Sorry, but we have to wreste the AI thing out from their arms. It's working on people's output, so it should be free. Time to nuke the big ones out from the sector.
robots.txt was a performance-hack. It never felt like a audience-filter. As sad as it might sound, hoping for filtering on publicly reachable content seems a bit naiv in my book. If you want your stuff not learnt by an AI, you better not publish it. Everything a human can read, an AI eventually will.
Some interesting studies on this I've done: <a href="https://cho.sh/r/F9F706" rel="nofollow">https://cho.sh/r/F9F706</a><p>Project
AIs.txt is a mental model of a machine learning permission system. Intuitively, question this: what if we could make a human-readable file that declines machine learning (a.k.a. Copilot use)? It's like robots.txt, but for Copilot.<p>User-agent: OpenAI
Disallow: /some-proprietary-codebase/<p>User-agent: Facebook
Disallow: /no-way-mark/<p>User-agent: Copilot
Disallow: /expensive-code/<p>Sitemap: /public/sitemap.xml
Sourcemap: /src/source.js.map
License: MIT<p># SOME LONG LEGAL STATEMENTS HERE<p>Key Issues
Would it be legally binding?
For now, no. It would be a polite way to mark my preference to opt-out of such data mining. It's closer to the Ask BigTechs Not to Track option rather than a legal license. Technically, Apple's App Tracking Transparency does not ban all tracking activity; it never can.<p>254AFC.png<p>Why not LICENSE or COPYING.txt?
Both are mainly written in human language and cannot provide granular scraping permissions depending on the collector. Also, GitHub Copilot ignores LICENSE or COPYING.txt, claiming we consented to Copilot using our codes for machine learning by signing up and pushing code to GitHub, We may expand the LICENSE system to include the terms for machine learning use, but that would even more edge case and chaotic licensing systems.<p>Does machine learning purposes of copyrighted works require a license?
This question is still under debate. Opt-out should be the default if it requires a license, making such a license system meaningless. If it doesn't require a license, then which company would respect the license system, given that it is not legally binding?<p>Is robots.txt legally binding?
No. Even if you scrape the web prohibited under robots.txt, it is not against the law. See HIQ LABS, INC., Plaintiff-Appellee, v. LINKEDIN CORPORATION, Defendant-Appellant.. robots.txt cannot make fair use illegal.<p>Any industry trends?
W3 has been working on robots.txt for machine learning, aligning with EU Copyright Directives.<p>The goal of this Community Group is to facilitate TDM in Europe and elsewhere by specifying a simple and practical machine-readable solution capable of expressing the reservation of TDM rights. w3c/tdm-reservation-protocol: Repository of the Text and Data Mining Reservation Protocol Community Group<p>Can we even draw the line?
No. One could reasonably argue that AI is doing the same as humans, much better and more efficiently. However, that claim goes against the fundamentals of intellectual property. If any IP is legally protected, machine-generated code must also have the same level of awareness system to respect it and prevent any plagiarism. Otherwise, they must bear legal duties.<p>Maybe it can benefit AI companies too
... by excluding all hacky codes and only opting for best-practice codes. If implemented correctly, it can work as an effective data sanitation system.