From the beginning, it's seemed completely intuitive to me that training a computer made of sand on publicly available content and then generating art later should be fair use, so long as it's fair use to train the meat computer in your head on the same content and then use it to generate art later. There's no meaningful difference to me as far as the ethics of the act are concerned.
No he doesn't.<p>> I think that with respect to content that’s already on the open web, the social contract of that content since the ‘90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been “freeware,” if you like, that’s been the understanding.<p>> There’s a separate category where a website, or a publisher, or a news organization had explicitly said ‘do not scrape or crawl me for any other reason than indexing me so that other people can find this content.’ That’s a grey area, and I think it’s going to work its way through the courts.
I'll bet they don't consider the windows and office source code fair game for arbitrary reuse provided the other party found the copy on the web. Even if the person found the copy on GitHub.
Isn't this discussion at all stupidly letting them control the goal posts? They have already gone far beyond this thinking that everything someone does on their own personal computer in their own home without the slightest bit of consent is going to be slurped up and recorded in case they want to query it someday.<p>This is like arguing that this guy who just murdered someone 10 minutes ago, should actually be able to steal the candy from this child since the child put it down on the park bench.
The more I read about this guy the more I get the feeling that he is an unscrupulous individual.<p>robots.txt is a "grey idea" to him, instead of being a directive to keep moving? Wow.
He compares fear of "AI" to fear of calculators. But "AI" cannot do math. Calculators do not "hallucinate". They are not correct "80%" of the time. They are correct 100% of the time. We know how they work. IIRC, in the 1970s someone at Bell Labs wrote a UNIX program that could generate fake academic papers. It might be a fun gag but it does it have much practical utility. No matter how "real" the papers might appear, or even if they are correct "80%" of the time, it is not an "invention", and it is certainly not comparable to a calculator.
Will this make people who make indirect money through their content, less motivated from publishing their content on the Web ?
This might be arguable.<p>May be, there should be a similar amount of openness in publishing the content used for training commercial models.<p>The copyright owner should have a privilege to ask for that content to be removed from training. This may also allow individual authors to gain their share with their Advanced RAG applications, that are specially focussed on the content they own and also published on the web.
One thing is a robots.txt policy, meant mostly for search crawlers.<p>Another thing is the copyright of the content, terms of use policies, etc.<p>Abiding by a robots.txt policy doesn't make you immune to copyright, terms of service, law in various jurisdictions, etc. If you think that you are probably a kleptomaniac.<p>Just create a robots.txt with "User-Agent: one billion asterisks" so that the crawlers die when parsing it.
It seems obvious to me that there is no such thing as AI without publicly training on the open web, and that any kind of licensing is an impossible feat.<p>Programs from my youth (Daria, Captain N) had licensed music for their broadcast, and that’s all because what else was ever going to be done? 20 years later, streaming with the music intact is an impossibility because the kind of money necessary to license <i>all</i> of it was too much. And you have to make deals with dozens of companies.<p>Multiply that by several orders of magnitude and you start to see the scope of the problem.
Part of the problem here is that the web has gone through lots of change as to what it is and how people understand it.<p>Some people think of it as billboards posted on the highway. Some think it’s a bulletin board. Some think it’s a newspaper. A television, a “zine”, a diary, graffiti. It has been all of these things, and is and isn’t. And people who publish are really bad at explicitly stating which one they are. But they expect you to know.
So we've now learned that copyright is determined by communications protocol. If you're using torrents it's copyright infringement, if it's the web then it's public domain.
Hmm hear me out, go to a public website and add black space below any video or picture with random adjectives that are your satire review of that piece of art then feed those into the ai model and tell it to ignore any text.
This is nothing but performative clickbait by the Verge.<p>It is classified as fair use, the term is transformative use, where those using it are training models (their intention) if anyone wishes to Google it.<p>The end.
The open web's ethos since its inception in the 1990s has been one of unrestricted access and fair use. Content published openly online inherently invites broad consumption, reproduction, and creative reuse by the public. This is not merely custom, but a fundamental aspect of fair use doctrine as applied to the digital realm.<p>The four factors of fair use - purpose of use, nature of the copyrighted work, amount used, and effect on the market - overwhelmingly favor allowing free use of openly published web content. The transformative nature of most reuses, the public availability of the original works, the necessity of using entire works in many cases, and the lack of a traditional market for such content all support this interpretation.<p>This longstanding practice has been the catalyst for unprecedented innovation and information dissemination. It represents a tacit social contract between content creators and users, establishing a de facto "freeware" model for open web content. Any attempt to retroactively impose strict copyright limitations would not only stifle innovation but also contradict decades of established legal precedent and digital norms.<p>-As a side note, I’m not certain that training necessarily involves “copying.”<p>—-Lastly, if anyone really thinks the Robert’s court is going to knee-cap AI, you’re soft in the head.