If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?<p>Let's say I host these, offer some indexing on it, and rewrite articles. Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee. To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.<p>Is that fair use? IANAL, but doesn't sound like it. Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.<p>A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all. Rather it points me at it, and offers me the opportunity to pay for the article. Whereas either this or an LLM uses NYT content as an alternative to actually paying for an NYT subscription.<p>But then what do I know...
The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use".
However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.<p>However, the suit goes far beyond claiming that such copying violates their copyright: "Unauthorized copying of Times Works without payment to train LLMs is a substitutive use that is not justified by any transformative purpose."<p>This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring. Hopefully the judge(s) will notice and direct focus on the interesting, high-stakes, and murky legal issues raised when we ask: What about a model can (or can't) be "transformative"?
The NYT is preparing for a tsunami by building a sandcastle. Big picture, this suit won’t matter, for <i>so</i> many reasons. To enumerate a few:<p>1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can <i>easily</i> whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.<p>2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.<p>3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.<p>I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.
The lawsuit itself (which arstechnica links to):<p><a href="https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf" rel="nofollow">https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...</a><p>From page 30 and onwards has some fairly clear examples on how ChatGPT has an (internal) copy of copyrighted material which it will recite verbatim.<p>Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.<p>As I see it the closeness of OpenAI may be what saves it. OpenAI could filter and block copyrighted material from the LLM from leaving the web interface using some straight forward matching mechanism against the copyrighted part of the data set ChatGPT has been trained on. Whereas open source projects trained on the same data set would be left with the much harder task of removing the copyrighted material from the LLM itself.
People who think the examples the lawsuit are “fair use” need to consider what that would mean. We’re basically going to let a few companies consolidate all the value on the Internet into their black boxes with basically no rules … that seems very dangerous to me.<p>I hope a court establishes some rules of engagement here, even if it’s not this case.
We developers like to pretend that LLM's are akin to humans and that they've been using things like NYTimes like humans as educational material.<p>But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.<p>The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.
Companies that have content all see dollar signs.<p>NYT won't mind if you use their content to train LLMs - as long as they get a commission. Reddit will shut down their free API and make you pay to get training content. Discord is going to be selling content for AI training too - if they haven't already done so. Twitter is doing it.<p>They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.
NYT's perspective is going to look so stupid in future when we put LLMs into mechanical bodies with the ability to interact with the physical world, and to learn/update their weights live. It would make it completely illegal for such a robot to read/watch/listen to any copyrighted material; no watching TV, no reading library books, no browsing the internet, because in doing so it could memorise some copyrighted content.
I've been arguing since ChatGPT came out that LLMs should fall under fair use as a "transformative work". I'm not a lawyer and this is just my non-expert opinion, but it will be interesting to see what the legal system has to say about this.
this was predicted in the very influential epic 2014 video in 02004<p><a href="https://www.youtube.com/watch?v=eUHBPuHS-7s" rel="nofollow">https://www.youtube.com/watch?v=eUHBPuHS-7s</a> (the original is flash and has thus been consigned to the memory hole, so we are left with this poor-quality conversion)<p>36": 'however, the press as you know it has ceased to exist'<p>40": '20th-century news organizations are an afterthought; a lonely remnant of a not-too-distant past'<p>2'11": 'also in 2002, google launches google news, a news portal. news organizations cry foul. google news is edited entirely by computers'<p>5'13": 'the news wars of 2010 are notable for the fact that no actual news organizations take part. googlezon finally checkmates microsoft with a feature the software giant cannot match: using a new algorithm, googlezon's computers construct new stories, dynamically stripping sentences and facts from all content sources, and recombining them. the computer writes a new story for every user'<p>5'55": 'in 2011 the slumbering fourth estate awakes to make its first and final stand. the new york times company sues googlezon, claiming that the company's fact-stripping robots are a violation of copyright law. the case goes all the way to the supreme court'<p>they didn't get the details exactly right, but overall the accuracy is astounding<p>however, that may be a hyperstition artifact in this timeline<p><a href="https://en.wikipedia.org/wiki/EPIC_2014" rel="nofollow">https://en.wikipedia.org/wiki/EPIC_2014</a> (i thought epic 2014 might be the only flash video to hae a wikipedia article about it, but then i looked and found five others)
> Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.<p>This is interesting. The NYT is specifically saying that the way you use an LLM impacts what you can legally use for training the LLM. They're firing shots at the big guys trying to sell access to an LLM, but not at the little guy self-hosting for fun or academics doing research.
> “The tragedy of the Luddites is not the fact that they failed to stop industrialization so much as the way in which they failed. Human rebellion proved inadequate against the pull of technological advancement.”<p><a href="https://www.newyorker.com/books/page-turner/rethinking-the-luddites-in-the-age-of-ai" rel="nofollow">https://www.newyorker.com/books/page-turner/rethinking-the-l...</a>
I see few people here bring this up, so let me:<p>The US constitution says, The Congress shall have Power<p>> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;<p>So the Congress's power to make copyright and patent laws is predicated on promotion of science and useful arts (I believe this actually means technology). In a sense, the OpenAI being the forefront of our AI technology advancement is crucial to the equation. To hinder the progress by copyright is, in my mind, unconstitutional.
If I create a news website where I write articles in the following way:<p>- Read 20 different news websites and their story on the same event/topic<p>- Wait an hour, grab a cup of coffee<p>- Sit down to write my article, never from this point I open any of the 20 news websites, I write the story from my head<p>- I don't consult any other source, just write from my memory, and my memory is, let's say, not the best one, so I will never write more than 10 words exactly as they appear on any of the 20 websites.<p>- I will probably also write something that is not correct or add something new because, as I said, my memory is not the best.<p>Is that fair use? Am I infringing on copyright?
I think LLMs may really change the IP landscape.<p>Culturally we’re taught that there is a moral component to copyright and patent law - that stealing is stealing. But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era. Copyright and patent laws exist for practical, pragmatic reasons - and seemingly they have served us well, but it’s not unreasonable to re-examine them from first principals.
Would be funny if NT Times won this and all commercial LLMs were shut down.<p>Then LLMs would be distributed only via torrents, like most copyright infringing media.
At some point the burden of carrying 100 year old copywriter/patent law will become so onerous a burden on the pace of progress that its enforcement will be antihuman.
Fair use is something Wikipedians dance around a fair amount. It also meant I did a <i>lot</i> of reading about it.<p>It’s a four part test. Let’s examine it thusly:<p>1. Transformative. Is it? It spits out informative text and opinion. The only “transformation” is that its generative text. IMO that’s a fail.<p>2. Nature of the work - it’s being used commercially. Given it’s being trained partially on editorial, that’s creative enough that I think any judge would find it problematic. Fail on this criteria.<p>3. Amount. It looks like they trained the model on all of the NYT articles. Oops, definite fail.<p>4. Effect on the market. Almost certainly negative for the NYT.<p>IMO, OpenAI cannot successfully claim fair use.
I read about this in the Times today (and am surprised that it wasn't on HN already).<p>My guess is that the court will likely find in the Times favor, because the legal system won't be able to understand how training works and because people are "scared" of AI. To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is. I might say something like "I, for one, welcome our new LLM overlords". Am I infringing the copyright of The Simpsons? No.<p>I am guessing some technicality like a terms-of-use violation of the website (avoidable if you go to the library and type in back issues of the Times), or storing the text between training sessions is what will do OpenAI in here. The legal system has never been particularly comfortable with how computers work; for example, the only reason EULAs work is because you "copy" software when your OS reads the program off of disk into memory (and from memory into cache, and from cache into registers). That would be copyright infringement according to courts, so you have to agree to a license to get that permission.<p>I think the precedent on copyright law is way off base, granting too much power to authors and too little to user. But because it's so favorable towards "rightsholders", I expect the Times to prevail here.
Given that Harvard President plagiarized her way into becoming a President, how can we be sure that NYT doesn't plagiarize and take content from X and other places to quickly chrun out daily news?
Related. Others?<p><i>NYT sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT</i> - <a href="https://news.ycombinator.com/item?id=38784194">https://news.ycombinator.com/item?id=38784194</a> - Dec 2023 (80 comments)<p><i>The New York Times is suing OpenAI and Microsoft for copyright infringement</i> - <a href="https://news.ycombinator.com/item?id=38781941">https://news.ycombinator.com/item?id=38781941</a> - Dec 2023 (837 comments)<p><i>The Times Sues OpenAI and Microsoft Over A.I.’s Use of Copyrighted Work</i> - <a href="https://news.ycombinator.com/item?id=38781863">https://news.ycombinator.com/item?id=38781863</a> - Dec 2023 (11 comments)
> To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is.<p>If the AI can recall the text verbatim then it's not at all the same. When we read we are not able to reproduce the book from our memory. Even if a human could memorise an entire book it's not at all practical to reproduce the book from that. The current AIs are not learning "ideas", they are learning orders of words.
Probably has something to do with impending deals between NYT and major companies, e.g.<p>[0] <a href="https://www.nytimes.com/2023/12/22/technology/apple-ai-news-publishers.html" rel="nofollow">https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...</a><p>[1] <a href="https://www.theverge.com/2023/12/22/24012730/apple-ai-models-news-publishers" rel="nofollow">https://www.theverge.com/2023/12/22/24012730/apple-ai-models...</a>
Something I have wondered about LLMs and training data is the idea that the biggest content producers on the internet now have their world view and tone echoed disproportionately as part of the next big wave of technology. This is incredibly impactful (although admittedly I don't know how to turn that into a profit). Is there some long term impact of removing the New York Times from training data that means it won't be part of the LLMs corpus going forward that is unforeseen?
If they don't let AIs to be trained on a maximum of data as possible, those AIs will be less "good" than the ones trained without constraints like you will have in China or elsewhere, and people will mechanically start using the later.<p>Unless they engage in massive IP and DNS banning, geolocation based, that forced upon all internet users and "external" users.
NYT wants to outlaw a math game created by calculating the probablities of word groupings and words following each other in NYT times articles, along with a lot of of other writings NYT does not own. The players roll the dice, so to speak, by seeding an initial string of words and whoever comes up with the most interesting paragraph wins. This paragraph may or may not look like NYT times writings, which in the larger scheme the collected writings of humankind, isn't particularly unique. It doesn't even have to be true. Hallucinations are an expected outcome.<p>If a NYT article says "Henry Kissenger was known to eat ice cream on a hot day" and our game outputs the same, it is purely by chance. It cannot be proven the output was copied verbatim from the NYT because the fragment "Henry Kissenger was known to" and "eat ice cream on a hot day" are not unique to the NYT or exclusive to it.<p>Is the NYT claiming ownership of the weights in LLMs?
Isn't copyright tethered somehow to a notion of "expression"? That is, the same ideas and facts expressed differently are a different work?<p>Sure, when something is clearly derived, or just expressed in a new medium, then I'm sure it's still covered. But if it goes through an LLM and the result bears little resemblance, how can that still fall under copyright?
TLDR:<p>"The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity.""
This argument that the LLM is learning seems slightly flawed when you consider that other experts in the field consider it more like lossy compression. If it’s lossy compression that’s really happening here then you can understand the copyright argument. It’ll be interesting to see how this plays out, lots of new ground breaking.
Why hasn't the Times also sued the Internet Archive? They've tried to block both the Internet Archive [1] and Open AI [2] from archiving their site, but why have they only sued OAI and not IA? The fact that they haven't sued IA which has comparatively little money would seem to indicate that this is not about fair use per se, but simply about profit-seeking and the NYT is selecting targets with deep pockets like OAI/MS.<p>[1] <a href="https://theintercept.com/2023/09/17/new-york-times-website-internet-archive/" rel="nofollow">https://theintercept.com/2023/09/17/new-york-times-website-i...</a><p>[2] <a href="https://fortune.com/2023/08/25/major-media-organizations-are-blocking-openai-bot-from-scraping-content/" rel="nofollow">https://fortune.com/2023/08/25/major-media-organizations-are...</a>
Looks like this is a case of Media vs Tech which might be solved by the courts using past paradigms but should really be addresses by legislation specific to this situation. The difficulty for the media companies, at least in the US, is that both major political parties see the media as the enemy. The left might be a bit more positive about the media but overall, they still see the media as something owned by wealthy elites suppressing knowledge of the harm the powerful inflict on the weak and powerless. Over on the Tech side of things, one party sees Tech as wholly owned by other side of the political divide. Over on that side, things are relatively (but not completely) friendly, so my guess is Tech will end up winning simply because it has more friends in the political realm than the Media does.
This wave is growing. Just cannot see how the big LLM players are going to get round this without paying big licence fees to content creators. Feels a bit like the torrent to Spotify moment, but for _all_ content, not just music. How they will manage the licensing model is beyond me, it’s going to be very easy for someone to sue these companies, but very difficult for the companies to calculate, attribute value and payout individual creators that contributed a tiny fraction of the training data. Surely this will make it very difficult for them to keep a business model working to a level their VC backers need to warrant even a fraction of their valuations.
in my head I like to think of web crawler search engines/search engine databases and LLMs as being somewhat similar. Search engines are ok if they just provide snippets with citations (urls), and they would be unacceptable if they provided large block quotes that removed the need to go to the original source to read the original expression of more complex ideas.<p>A web-crawled LLM that lived within the same constraints would be a search engine under another name, with a slightly different presentation style. If it starts spitting out entire articles without citation, that's not acceptable.
Not that it would solve this, but how hard would it be for ChatGpt or other problems to cite the sources used in a response. Is that difficult to capture and tag to 'knowledge' within a LLM?
It could be a best of both worlds type situation if LLMs cited sources and linked to the source itself.
Isn't that what happened with Google News's home page? I seem to recall that when Google took it away in some markets, at the behest of the news orgs, they quickly reversed course as their traffic plummeted.
I think Apple has really got ahead of this game: early deals to pay for AI training data/content. I need to do some research but I think Anthropic also does this.<p>After a year of largely using OpenAI APIs, I am now much more into smaller “open” models for I hope the major contributors like Meta/Facebook are following Apple’s lead. Off topic, but: even finding the smaller “open” models much less capable, they capture my imagination and my personal research time.
I think there is a national security aspect to ML models trained on copyrighted data. Countries that allow it will gain a superior technological advantage and outcompete those who disallow training on copyrighted material. I personally believe training LLMs on copyrighted data is copyright infringement if the models are deployed in a way that competes with the copyright holder. But that doesn’t necessarily mean it’s something we should disallow.
The thing that bothers me about the whole situation is that OpenAI prohibits using its model output as training data for your own models.<p>It seems more than a bit hypocritical, no? When it comes to their own training data, they claim to have the right to use any/all of humanity’s intellectual output. But for your own training data, you can use everything <i>except</i> for their product, conveniently for them.
Under existing condition an AI news site seems like a good investment idea. Its AI could read all relevant news sources and retell them and republish them in its own articles. It could even have its own AI editors and contributors. Cannot see how human news companies could compete.
This, or a lawsuit like it is going to be the SCO vs IBM of the 2020's, to wit: a copyright troll trying to extract rent, with various special interests cheering it on to try and promote their own agenda (ironically it was Microsoft that played that role with SCO). It's funny how times have changed and at least now a louder group seem to be on the troll's side. I hope to see some better analysis on the frivolity of this come out. There may be some commercial subtlety in specific cases that doesn't depend on scraping and training, but fundamentally using public internet data for training is not copying, is fair use, and is better for society as a whole than whatever ridiculous alternative might be proposed.<p>edit: I'm speaking about training broadly capable foundation models like GPTn. It would of course be possible to build a model that only parrots copyrighted content and it would be hard to argue that is fair use.
Microsoft is one of the companies that love to use copyright to get their way, BSA is known software mafia, so I'm not at all sympathetic to them.
What if you were one of the people who read the Times from cover-to-cover every day and seriously tries to remember as much as possible because you consider it a trustworthy reference source?<p>And if you were called upon to solve a problem based on knowledge you consider trustworthy, what would you come up with?<p>What if you were even specifically directed to utilize only findings gleaned from the Times exclusively?<p>And what if that was your only lifetime source of information whatsoever for some reason?
Huh, is this a big misunderstanding?<p>The copilot screenshot they gave in the ars-technica article as well as many of the screenshots in the NYT article seems like it's actually displaying correct behavior for browsing the web.<p>In these cases the system is more or less acting as a user agent (browser). AFAICT the NYT server actually gave that data to the user agent when it asked politely (200 OK, presumably). The user agent then displayed it to the user, which the user agent may do in any way it deems fit or appropriate.<p>There's only one or two cases where this has gone against the user or user agent, in very specific circumstances. The server can eg say 403 Forbidden whenever it likes, so if it returns a 200 OK, what's a user agent to do other than believe it at its word?<p>The only twist is that this user agent is now Imbued With AI (tm)(r)(c) . I don't think that really makes a difference here. If that's all this is, then it's more related to legal fights over certain ad-blockers or readability, which have similar functionality.<p>* <a href="https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf" rel="nofollow">https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...</a> , eg. page 45; I mean it says "Model: Web Browsing" at the top, and "Finished browsing" right on the page. That particular subsystem is now integrated, so the UI/UX is different now, but IIRC the link was in the pulldown?
Two not-so subtle paragraphs about the "partnership" between Microsoft and OpenAI:<p>> 15. Microsoft Corporation is a Washington corporation with a principal place of
business and headquarters in Redmond, Washington. Microsoft has invested at least $13 billion in OpenAI Global LLC in exchange for which Microsoft will receive 75% of that company’s profits until its investment is repaid, after which Microsoft will own a 49% stake in that company.<p>> 16. Microsoft has described its relationship with the OpenAI Defendants as a
“partnership.” This partnership has included contributing and operating the cloud computing services used to copy Times Works and train the OpenAI Defendants’ GenAI models. It has also included, upon information and belief, substantial technical collaboration on the creation of those models. Microsoft possesses copies of, or obtains preferential access to, the OpenAI Defendants’ latest GenAI models that have been trained on and embody unauthorized copies of the Times Works. Microsoft uses these models to provide infringing content and, at times, misinformation to
users of its products and online services. During a quarterly earnings call in October 2023, Microsoft noted that “more than 18,000 organizations now use Azure OpenAI Service, including new-to-Azure customers.”
Summarizing the article: The most damning thing here is the "ChatGPT as a search engine" feature, which appears to run an agent which performs a search, visits pages, and returns the best results.<p>In doing this, it is bypassing the NY Times paywall, and you can read full articles from today by repeatedly asking for the next paragraph.
Let's try the "reverse the gender" card.<p>Let's say OpenAI was trained on all the Windows source code (without approval from MS).<p>GPT could pretty much replicate the windows code with even not that clever prompt by any user. "Write an OS CreateProcess function like Windows 10 source code would have."<p>It would infuriate MS to put it mildly, enough to start a lawsuit.<p>I know the license to the MS source code and NYT articles aren't the same.
Isn't the fundamental issue here that the NYT was available in Common Crawl?<p>If they didn't want to share their content, why did they allow it to be scraped?<p>If they did want to share their content, why do they care (hint: $88 billion)?<p>Or is it that they wanted to share their content with Google and other search engines in order to bring in readers but now that an AI was trained on it they are angry?<p>What wrong thing did OpenAI do specific to using Common Crawl?<p>Didn't most companies use Common Crawl? Excepting Google, who had already scraped the whole damn Internet anyway and just used their search index?<p>Is it legal or not to scrape the web?<p>If I scrape the web, is it legal to train a transformer on it? Why or why not?<p>To me, this is an incredibly open-and-shut case. You put something on the web, people will read that something. If that is illegal, Google is illegal.<p>Oh, and do you see the part in the article where they are butthurt that it can reproduce the NYT style?<p>> "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.<p>Mimics its expressive style. Oh golly the robots can write like they're smug NYT reporters now--better sue!<p>It appears that the NYT changed their terms of service in August to disallow their content in Common Crawl[0]. Wasn't GPT-4 trained far before August?<p>0]: <a href="https://www.adweek.com/media/the-new-york-times-updates-terms-of-service-to-prevent-ai-scraping-its-content/" rel="nofollow">https://www.adweek.com/media/the-new-york-times-updates-term...</a>
Sad to say but I would believe a hallucination from OpenAI before I would believe anything that comes out of the NY Times. I mean the confidence interval for the NY Times is what again?
I think this could be shakedown. They want money/licensing from openai the way apple was offering news companies recently. High probability this is settled out of court.
I asked an LLM to summarize the 69 page lawsuit. It does a decent job. Didn't infringe on any copyrights in the process :)<p>Here is a summary of the key points from the legal complaint filed by The New York Times against Microsoft and OpenAI:<p>The New York Times filed a copyright infringement lawsuit against Microsoft and OpenAI alleging that their generative AI tools like ChatGPT and Bing Chat infringe on The Times's intellectual property rights by copying and reproducing Times content without permission to train their AI models.<p>The Times invests enormous resources into producing high-quality, original journalism and has over 3 million registered copyrighted works. Its business models rely on subscriptions, advertising, licensing fees, and affiliate referrals, all of which require direct traffic to NYTimes.com.<p>The complaint alleges Microsoft and OpenAI copied millions of Times articles, investigations, reviews, and other content on a massive scale without permission to train their AI models. The models encode and "memorize" copies of Times works which can be retrieved verbatim. Defendants' tools like ChatGPT and Bing then display this protected content publicly.<p>OpenAI promised to freely share its AI research when founded in 2015 but pivoted to a for-profit model in 2019. Microsoft invested billions into OpenAI and provides all its cloud computing. Their partnership built special systems to scrape and store training data sets with Times content emphasized.<p>The complaint includes many examples of the AI models reciting verbatim excerpts of Times articles, showing they were trained on this data. It also shows the models fabricating quotes and attributing them to the Times.<p>Microsoft's integration of the OpenAI models into Bing Chat and other products boosted its revenues and market value tremendously. OpenAI's release of ChatGPT also made it hugely valuable. But their commercial success relies significantly on unlicensed use of Times works.<p>The Times attempted to negotiate a deal with Microsoft and OpenAI but failed, hence this lawsuit. Generating substitute products that compete with inputs used to train models does not qualify as "fair use" exemptions to copyright. The Times seeks damages and injunctive relief.<p>In summary, The New York Times alleges Microsoft and OpenAI's AI products infringe Times copyrights on a massive scale to unfairly benefit at The Times's expense. The Times invested heavily in content creation and controls how its work is used commercially. Using Times content without payment or permission to build competitive tools violates its rights under copyright law.
Worth noting, that - at least the screenshot - shows an example of browsing functionality used to go around paywalls, not that the model itself is trained, or can reproduce the articles really.<p>IIRC this was the reason why the browsing plugin was disabled for some time after its introduction - they were patching up this hole.
> All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall.<p>Not to be pedantic, but NYT has the <i>least</i> robust paywall I've ever seen. Just turn on reader mode in your browser. Simple. I get that it's still tresspassing if I walk into an unlocked house, but NYT could try installing a lock that isn't made of confetti and uncooked pasta.
It's interesting to me the ambiguous attitude people have to reproducing news content. Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim.<p>And this seems to be tolerated as the norm. And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.<p>I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.<p>And the reason I bring this up, is that it seems like Open AI has the same attitude: scraping news articles is OK, or at worst a gray area, but what if they were also scraping, for example, Netflix content to use as part of their training set?
Won't hold in court. GPT is a platform mainly providing answer to private individuals asking. Is like you ask a professor a question and he answered verbatim what copyrighted materials available (due to photographic memory) word for word back to you. Now if you take this answer and write a book or publish enmass on blogs for example, then you are the one should be sued by NYT. If GPT use the exact same wordings and publish it out to evetyone visiting their page, then that is on OpenAI.
I don't think the lawsuit has any merit, but I'd still like to encourage Sam Altman et al, if they really care about the greater good, to go Keyser Söze and immediately release torrents of the weights and source code for GPT-4 under GPL.
It's obviously a frivolous suit that will only net at best a ceremonial victory for NYTimes: 8 figure max payout and a promise to not use NYtimes material in the future.<p>The trajectory and value to society of OpenAI vs NYtimes could not be greater. They have won no favors in the court of public opinion with their frequent misinformation. It's all just a big waste of time, the last of the old guard flailing against the march of progress.<p>And even hypothetially if they managed to get OpenAI to delete ChatGPT they'd be hated forever.
I read a NYT article and publish a summary of facts that I learned: totally legit.<p>Train a model on NYT text that outputs a summary of facts that it learned: OMG literally murder.
Should be: "NY Times wants OpenÄI to delete all GPT instances". You wouldn't want the hapless rabble misreading it as an "aiii" diphthong.