I still find it very (depressingly) hilarious how everybody sees this as a lawsuit about if training on copyrighted context is legal or not.<p>Literally, the NYT claimed that OpenAI maintained a database of NYT's works and would just verbatim surface the content. This is not an AI issue, it's settled copyright law.
I like following the OpenAI vs. NYT case, as it's a great example of the controversial situation:<p>- OpenAI created their models by parsing the internet by disregarding the copyrights, licenses, etc., or looking for a law loopholes<p>- by doing that, OpenAI (alongside others) developed a new progressive tool that is shaping the world, and seems to be the next “internet”-like (impact-wise) thing<p>- NYT is not happy about that, as their content is their main asset<p>- less democratic countries, can apply even less ethical practices for data mining, as the copyright laws don't work there, so one might claim that it's a question of national defense, considering the fact that AI is actively used in the miltech these days<p>- while the ethical part is less controversial (imho, as I'm with NYT there), the legal one is more complicated: the laws might simply say nothing about this use case (think GPL vs. AGPL license), so the world might need new ones.<p>And so on...
Is anyone building a <i>public domain</i> repository / AI training ground for old newspapers? Anything before 1930 has no restrictions. Newspapers.com has pretty good content but the interface and search is extremely lacking. Google News was abandoned a decade ago. This seems like something where AI could really help, for once. Not in training chatbots or whatever but actually just providing great search for articles in books, newspapers, and magazines.
Would anyone here be able to explain to me where this money is going? Are the lawyers working for the New York Times really this expensive? If so these lawyers must be getting massive amounts of money...
NYT will lose:<p>Copyright only protects the actual text. LLMs have weights, not exact copies. In any case, saying "if I put in some input and get copyrighted output" is tantamount to copyright violations; if I use a generative tool and generate copyrighted info is it the tools fault?<p>An LLM is a dump of effectively arbitrary numbers that, when hooked up to a command line, uses one of the world's most awful programming languages to evaluate and execute.<p>OpenAI at most broke an EULA or some technicality on copyright w.r.t. local ephemeral copies. What's the damage to the NYT though?
Are they paying the lawyers with government money? I'm seriously asking. Why is the government paying 10s of millions of dollars/year to the New York Times? How can they still claim to be a news organization without having disclosed this? If the government is paying the NYT, then don't their productions belong in the public domain?<p><a href="https://x.com/stillgray/status/1887191056074350690" rel="nofollow">https://x.com/stillgray/status/1887191056074350690</a>
"OpenAI asserts that training AI models using publicly accessible content, including material from The New York Times, is protected under longstanding fair use principles."<p>Incredible.<p>The foundation of fair use is a transformative and non-consumptive use of copyrighted material.
My ideal solution would be to public domain anything NYT has written in the past, turn it over to archive.org, and dismantle NYT so it’s no longer an issue in the future.