NY Times copyright suit wants OpenAI to delete all GPT instances

529 pointsby justinc8687over 1 year ago

82 comments

rich_sashaover 1 year ago

If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?Let's say I host these, offer some indexing on it, and rewrite articles. Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee. To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.Is that fair use? IANAL, but doesn't sound like it. Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all. Rather it points me at it, and offers me the opportunity to pay for the article. Whereas either this or an LLM uses NYT content as an alternative to actually paying for an NYT subscription.But then what do I know...

评论 #38794848 未加载

评论 #38791295 未加载

评论 #38795262 未加载

评论 #38791630 未加载

评论 #38796673 未加载

评论 #38792614 未加载

评论 #38801551 未加载

评论 #38792823 未加载

评论 #38791407 未加载

评论 #38797883 未加载

评论 #38797812 未加载

评论 #38791376 未加载

评论 #38793139 未加载

评论 #38791920 未加载

评论 #38793420 未加载

评论 #38793527 未加载

评论 #38795499 未加载

评论 #38796181 未加载

评论 #38791734 未加载

评论 #38796621 未加载

评论 #38793686 未加载

评论 #38800068 未加载

评论 #38795004 未加载

评论 #38800044 未加载

评论 #38791615 未加载

评论 #38795802 未加载

评论 #38791798 未加载

评论 #38791225 未加载

groceryheistover 1 year ago

The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.However, the suit goes far beyond claiming that such copying violates their copyright: "Unauthorized copying of Times Works without payment to train LLMs is a substitutive use that is not justified by any transformative purpose."This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring. Hopefully the judge(s) will notice and direct focus on the interesting, high-stakes, and murky legal issues raised when we ask: What about a model can (or can't) be "transformative"?

评论 #38791026 未加载

评论 #38791154 未加载

评论 #38791093 未加载

评论 #38792336 未加载

评论 #38790949 未加载

评论 #38795482 未加载

评论 #38794612 未加载

评论 #38795432 未加载

评论 #38798083 未加载

评论 #38792854 未加载

评论 #38791203 未加载

评论 #38795162 未加载

评论 #38790895 未加载

a_wild_dandanover 1 year ago

The NYT is preparing for a tsunami by building a sandcastle. Big picture, this suit won’t matter, for so many reasons. To enumerate a few:1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.

评论 #38793246 未加载

评论 #38793294 未加载

评论 #38796362 未加载

评论 #38793611 未加载

评论 #38798012 未加载

评论 #38797816 未加载

评论 #38793568 未加载

评论 #38799292 未加载

评论 #38795944 未加载

评论 #38793687 未加载

throwaway4goodover 1 year ago

The lawsuit itself (which arstechnica links to):<a href="https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf" rel="nofollow">https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...</a>From page 30 and onwards has some fairly clear examples on how ChatGPT has an (internal) copy of copyrighted material which it will recite verbatim.Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.As I see it the closeness of OpenAI may be what saves it. OpenAI could filter and block copyrighted material from the LLM from leaving the web interface using some straight forward matching mechanism against the copyrighted part of the data set ChatGPT has been trained on. Whereas open source projects trained on the same data set would be left with the much harder task of removing the copyrighted material from the LLM itself.

评论 #38793122 未加载

评论 #38797447 未加载

评论 #38792895 未加载

habosaover 1 year ago

People who think the examples the lawsuit are “fair use” need to consider what that would mean. We’re basically going to let a few companies consolidate all the value on the Internet into their black boxes with basically no rules … that seems very dangerous to me.I hope a court establishes some rules of engagement here, even if it’s not this case.

评论 #38795149 未加载

评论 #38793272 未加载

评论 #38801041 未加载

评论 #38793410 未加载

评论 #38793383 未加载

wouldbecouldbeover 1 year ago

We developers like to pretend that LLM's are akin to humans and that they've been using things like NYTimes like humans as educational material.But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.

评论 #38791696 未加载

评论 #38792035 未加载

评论 #38802465 未加载

aurareturnover 1 year ago

Companies that have content all see dollar signs.NYT won't mind if you use their content to train LLMs - as long as they get a commission. Reddit will shut down their free API and make you pay to get training content. Discord is going to be selling content for AI training too - if they haven't already done so. Twitter is doing it.They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.

评论 #38791273 未加载

评论 #38790668 未加载

评论 #38790628 未加载

logicchainsover 1 year ago

NYT's perspective is going to look so stupid in future when we put LLMs into mechanical bodies with the ability to interact with the physical world, and to learn/update their weights live. It would make it completely illegal for such a robot to read/watch/listen to any copyrighted material; no watching TV, no reading library books, no browsing the internet, because in doing so it could memorise some copyrighted content.

评论 #38791290 未加载

评论 #38792769 未加载

评论 #38792267 未加载

评论 #38791271 未加载

fasterikover 1 year ago

I've been arguing since ChatGPT came out that LLMs should fall under fair use as a "transformative work". I'm not a lawyer and this is just my non-expert opinion, but it will be interesting to see what the legal system has to say about this.

评论 #38790665 未加载

评论 #38790866 未加载

评论 #38790710 未加载

评论 #38791013 未加载

评论 #38803753 未加载

kragenover 1 year ago

this was predicted in the very influential epic 2014 video in 02004<a href="https://www.youtube.com/watch?v=eUHBPuHS-7s" rel="nofollow">https://www.youtube.com/watch?v=eUHBPuHS-7s</a> (the original is flash and has thus been consigned to the memory hole, so we are left with this poor-quality conversion)36": 'however, the press as you know it has ceased to exist'40": '20th-century news organizations are an afterthought; a lonely remnant of a not-too-distant past'2'11": 'also in 2002, google launches google news, a news portal. news organizations cry foul. google news is edited entirely by computers'5'13": 'the news wars of 2010 are notable for the fact that no actual news organizations take part. googlezon finally checkmates microsoft with a feature the software giant cannot match: using a new algorithm, googlezon's computers construct new stories, dynamically stripping sentences and facts from all content sources, and recombining them. the computer writes a new story for every user'5'55": 'in 2011 the slumbering fourth estate awakes to make its first and final stand. the new york times company sues googlezon, claiming that the company's fact-stripping robots are a violation of copyright law. the case goes all the way to the supreme court'they didn't get the details exactly right, but overall the accuracy is astoundinghowever, that may be a hyperstition artifact in this timeline<a href="https://en.wikipedia.org/wiki/EPIC_2014" rel="nofollow">https://en.wikipedia.org/wiki/EPIC_2014</a> (i thought epic 2014 might be the only flash video to hae a wikipedia article about it, but then i looked and found five others)

Kon-Pekiover 1 year ago

> Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.This is interesting. The NYT is specifically saying that the way you use an LLM impacts what you can legally use for training the LLM. They're firing shots at the big guys trying to sell access to an LLM, but not at the little guy self-hosting for fun or academics doing research.

评论 #38796397 未加载

sunpazedover 1 year ago

> “The tragedy of the Luddites is not the fact that they failed to stop industrialization so much as the way in which they failed. Human rebellion proved inadequate against the pull of technological advancement.”<a href="https://www.newyorker.com/books/page-turner/rethinking-the-luddites-in-the-age-of-ai" rel="nofollow">https://www.newyorker.com/books/page-turner/rethinking-the-l...</a>

bdd8f1df777bover 1 year ago

I see few people here bring this up, so let me:The US constitution says, The Congress shall have Power> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;So the Congress's power to make copyright and patent laws is predicated on promotion of science and useful arts (I believe this actually means technology). In a sense, the OpenAI being the forefront of our AI technology advancement is crucial to the equation. To hinder the progress by copyright is, in my mind, unconstitutional.

评论 #38791628 未加载

评论 #38791587 未加载

ssijakover 1 year ago

If I create a news website where I write articles in the following way:- Read 20 different news websites and their story on the same event/topic- Wait an hour, grab a cup of coffee- Sit down to write my article, never from this point I open any of the 20 news websites, I write the story from my head- I don't consult any other source, just write from my memory, and my memory is, let's say, not the best one, so I will never write more than 10 words exactly as they appear on any of the 20 websites.- I will probably also write something that is not correct or add something new because, as I said, my memory is not the best.Is that fair use? Am I infringing on copyright?

评论 #38791905 未加载

JackFrover 1 year ago

I think LLMs may really change the IP landscape.Culturally we’re taught that there is a moral component to copyright and patent law - that stealing is stealing. But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era. Copyright and patent laws exist for practical, pragmatic reasons - and seemingly they have served us well, but it’s not unreasonable to re-examine them from first principals.

评论 #38792089 未加载

评论 #38792068 未加载

tarrudaover 1 year ago

Would be funny if NT Times won this and all commercial LLMs were shut down.Then LLMs would be distributed only via torrents, like most copyright infringing media.

评论 #38791043 未加载

评论 #38791655 未加载

评论 #38791035 未加载

elifover 1 year ago

At some point the burden of carrying 100 year old copywriter/patent law will become so onerous a burden on the pace of progress that its enforcement will be antihuman.

评论 #38793445 未加载

chris_wotover 1 year ago

Fair use is something Wikipedians dance around a fair amount. It also meant I did a lot of reading about it.It’s a four part test. Let’s examine it thusly:1. Transformative. Is it? It spits out informative text and opinion. The only “transformation” is that its generative text. IMO that’s a fail.2. Nature of the work - it’s being used commercially. Given it’s being trained partially on editorial, that’s creative enough that I think any judge would find it problematic. Fail on this criteria.3. Amount. It looks like they trained the model on all of the NYT articles. Oops, definite fail.4. Effect on the market. Almost certainly negative for the NYT.IMO, OpenAI cannot successfully claim fair use.

评论 #38791875 未加载

jrockwayover 1 year ago

I read about this in the Times today (and am surprised that it wasn't on HN already).My guess is that the court will likely find in the Times favor, because the legal system won't be able to understand how training works and because people are "scared" of AI. To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is. I might say something like "I, for one, welcome our new LLM overlords". Am I infringing the copyright of The Simpsons? No.I am guessing some technicality like a terms-of-use violation of the website (avoidable if you go to the library and type in back issues of the Times), or storing the text between training sessions is what will do OpenAI in here. The legal system has never been particularly comfortable with how computers work; for example, the only reason EULAs work is because you "copy" software when your OS reads the program off of disk into memory (and from memory into cache, and from cache into registers). That would be copyright infringement according to courts, so you have to agree to a license to get that permission.I think the precedent on copyright law is way off base, granting too much power to authors and too little to user. But because it's so favorable towards "rightsholders", I expect the Times to prevail here.

评论 #38791809 未加载

评论 #38791328 未加载

评论 #38791780 未加载

评论 #38791329 未加载

cpt100over 1 year ago

Given that Harvard President plagiarized her way into becoming a President, how can we be sure that NYT doesn't plagiarize and take content from X and other places to quickly chrun out daily news?

strangusover 1 year ago

Next up, Microsoft acquires the New York Times forming MSNYT

评论 #38790571 未加载

评论 #38790981 未加载

dangover 1 year ago

Related. Others?NYT sues OpenAI, Microsoft over 'millions of articles' used to train ChatGPT - <a href="https://news.ycombinator.com/item?id=38784194">https://news.ycombinator.com/item?id=38784194</a> - Dec 2023 (80 comments)The New York Times is suing OpenAI and Microsoft for copyright infringement - <a href="https://news.ycombinator.com/item?id=38781941">https://news.ycombinator.com/item?id=38781941</a> - Dec 2023 (837 comments)The Times Sues OpenAI and Microsoft Over A.I.’s Use of Copyrighted Work - <a href="https://news.ycombinator.com/item?id=38781863">https://news.ycombinator.com/item?id=38781863</a> - Dec 2023 (11 comments)

评论 #38797514 未加载

globular-toastover 1 year ago

> To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is.If the AI can recall the text verbatim then it's not at all the same. When we read we are not able to reproduce the book from our memory. Even if a human could memorise an entire book it's not at all practical to reproduce the book from that. The current AIs are not learning "ideas", they are learning orders of words.

评论 #38791889 未加载

weikjuover 1 year ago

Probably has something to do with impending deals between NYT and major companies, e.g.[0] <a href="https://www.nytimes.com/2023/12/22/technology/apple-ai-news-publishers.html" rel="nofollow">https://www.nytimes.com/2023/12/22/technology/apple-ai-news-...</a>[1] <a href="https://www.theverge.com/2023/12/22/24012730/apple-ai-models-news-publishers" rel="nofollow">https://www.theverge.com/2023/12/22/24012730/apple-ai-models...</a>

sackfieldover 1 year ago

Something I have wondered about LLMs and training data is the idea that the biggest content producers on the internet now have their world view and tone echoed disproportionately as part of the next big wave of technology. This is incredibly impactful (although admittedly I don't know how to turn that into a profit). Is there some long term impact of removing the New York Times from training data that means it won't be part of the LLMs corpus going forward that is unforeseen?

sylwareover 1 year ago

If they don't let AIs to be trained on a maximum of data as possible, those AIs will be less "good" than the ones trained without constraints like you will have in China or elsewhere, and people will mechanically start using the later.Unless they engage in massive IP and DNS banning, geolocation based, that forced upon all internet users and "external" users.

labradorover 1 year ago

NYT wants to outlaw a math game created by calculating the probablities of word groupings and words following each other in NYT times articles, along with a lot of of other writings NYT does not own. The players roll the dice, so to speak, by seeding an initial string of words and whoever comes up with the most interesting paragraph wins. This paragraph may or may not look like NYT times writings, which in the larger scheme the collected writings of humankind, isn't particularly unique. It doesn't even have to be true. Hallucinations are an expected outcome.If a NYT article says "Henry Kissenger was known to eat ice cream on a hot day" and our game outputs the same, it is purely by chance. It cannot be proven the output was copied verbatim from the NYT because the fragment "Henry Kissenger was known to" and "eat ice cream on a hot day" are not unique to the NYT or exclusive to it.Is the NYT claiming ownership of the weights in LLMs?

chmod600over 1 year ago

Isn't copyright tethered somehow to a notion of "expression"? That is, the same ideas and facts expressed differently are a different work?Sure, when something is clearly derived, or just expressed in a new medium, then I'm sure it's still covered. But if it goes through an LLM and the result bears little resemblance, how can that still fall under copyright?

评论 #38791371 未加载

biglyburritoover 1 year ago

TLDR:"The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity.""

评论 #38790558 未加载

评论 #38790393 未加载

djhope99over 1 year ago

This argument that the LLM is learning seems slightly flawed when you consider that other experts in the field consider it more like lossy compression. If it’s lossy compression that’s really happening here then you can understand the copyright argument. It’ll be interesting to see how this plays out, lots of new ground breaking.

munchinatorover 1 year ago

Why hasn't the Times also sued the Internet Archive? They've tried to block both the Internet Archive [1] and Open AI [2] from archiving their site, but why have they only sued OAI and not IA? The fact that they haven't sued IA which has comparatively little money would seem to indicate that this is not about fair use per se, but simply about profit-seeking and the NYT is selecting targets with deep pockets like OAI/MS.[1] <a href="https://theintercept.com/2023/09/17/new-york-times-website-internet-archive/" rel="nofollow">https://theintercept.com/2023/09/17/new-york-times-website-i...</a>[2] <a href="https://fortune.com/2023/08/25/major-media-organizations-are-blocking-openai-bot-from-scraping-content/" rel="nofollow">https://fortune.com/2023/08/25/major-media-organizations-are...</a>

评论 #38791627 未加载

评论 #38791536 未加载

评论 #38791612 未加载

Mountain_Skiesover 1 year ago

Looks like this is a case of Media vs Tech which might be solved by the courts using past paradigms but should really be addresses by legislation specific to this situation. The difficulty for the media companies, at least in the US, is that both major political parties see the media as the enemy. The left might be a bit more positive about the media but overall, they still see the media as something owned by wealthy elites suppressing knowledge of the harm the powerful inflict on the weak and powerless. Over on the Tech side of things, one party sees Tech as wholly owned by other side of the political divide. Over on that side, things are relatively (but not completely) friendly, so my guess is Tech will end up winning simply because it has more friends in the political realm than the Media does.

lwhiover 1 year ago

Surely there's no chance OpenAI would agree to this?Isn't it more likely that the company buys the NYT?

rimeiceover 1 year ago

This wave is growing. Just cannot see how the big LLM players are going to get round this without paying big licence fees to content creators. Feels a bit like the torrent to Spotify moment, but for _all_ content, not just music. How they will manage the licensing model is beyond me, it’s going to be very easy for someone to sue these companies, but very difficult for the companies to calculate, attribute value and payout individual creators that contributed a tiny fraction of the training data. Surely this will make it very difficult for them to keep a business model working to a level their VC backers need to warrant even a fraction of their valuations.

fsckboyover 1 year ago

in my head I like to think of web crawler search engines/search engine databases and LLMs as being somewhat similar. Search engines are ok if they just provide snippets with citations (urls), and they would be unacceptable if they provided large block quotes that removed the need to go to the original source to read the original expression of more complex ideas.A web-crawled LLM that lived within the same constraints would be a search engine under another name, with a slightly different presentation style. If it starts spitting out entire articles without citation, that's not acceptable.

评论 #38791280 未加载

bigmattystylesover 1 year ago

Not that it would solve this, but how hard would it be for ChatGpt or other problems to cite the sources used in a response. Is that difficult to capture and tag to 'knowledge' within a LLM? It could be a best of both worlds type situation if LLMs cited sources and linked to the source itself. Isn't that what happened with Google News's home page? I seem to recall that when Google took it away in some markets, at the behest of the news orgs, they quickly reversed course as their traffic plummeted.

评论 #38795056 未加载

评论 #38794976 未加载

mark_l_watsonover 1 year ago

I think Apple has really got ahead of this game: early deals to pay for AI training data/content. I need to do some research but I think Anthropic also does this.After a year of largely using OpenAI APIs, I am now much more into smaller “open” models for I hope the major contributors like Meta/Facebook are following Apple’s lead. Off topic, but: even finding the smaller “open” models much less capable, they capture my imagination and my personal research time.

评论 #38795121 未加载

4death4over 1 year ago

I think there is a national security aspect to ML models trained on copyrighted data. Countries that allow it will gain a superior technological advantage and outcompete those who disallow training on copyrighted material. I personally believe training LLMs on copyrighted data is copyright infringement if the models are deployed in a way that competes with the copyright holder. But that doesn’t necessarily mean it’s something we should disallow.

评论 #38795276 未加载

kweingarover 1 year ago

The thing that bothers me about the whole situation is that OpenAI prohibits using its model output as training data for your own models.It seems more than a bit hypocritical, no? When it comes to their own training data, they claim to have the right to use any/all of humanity’s intellectual output. But for your own training data, you can use everything except for their product, conveniently for them.

ml-anonover 1 year ago

ITT: the perfect storm of people who dont know how LLMs work, dont know how the brain works and dont know how the law works.

sensanatyover 1 year ago

I love seeing all the AI sycophants squirm at this news.Here's to hoping NYT wins this one and gets everything they ask for, and more!

评论 #38794268 未加载

frakrxover 1 year ago

Under existing condition an AI news site seems like a good investment idea. Its AI could read all relevant news sources and retell them and republish them in its own articles. It could even have its own AI editors and contributors. Cannot see how human news companies could compete.

评论 #38791679 未加载

andy99over 1 year ago

This, or a lawsuit like it is going to be the SCO vs IBM of the 2020's, to wit: a copyright troll trying to extract rent, with various special interests cheering it on to try and promote their own agenda (ironically it was Microsoft that played that role with SCO). It's funny how times have changed and at least now a louder group seem to be on the troll's side. I hope to see some better analysis on the frivolity of this come out. There may be some commercial subtlety in specific cases that doesn't depend on scraping and training, but fundamentally using public internet data for training is not copying, is fair use, and is better for society as a whole than whatever ridiculous alternative might be proposed.edit: I'm speaking about training broadly capable foundation models like GPTn. It would of course be possible to build a model that only parrots copyrighted content and it would be hard to argue that is fair use.

评论 #38793914 未加载

评论 #38793942 未加载

评论 #38793938 未加载

评论 #38796319 未加载

评论 #38793881 未加载

1f60cover 1 year ago

I believe that ChatGPT is fair use, just on a much larger scale than we're used to.

chucke1992over 1 year ago

While copyright laws are ancient and outdated - and probably should be reformed or removed altogether - this lawsuit might be entertaining.

shp0ngleover 1 year ago

Microsoft is one of the companies that love to use copyright to get their way, BSA is known software mafia, so I'm not at all sympathetic to them.

dash2over 1 year ago

There’s an awful lot of confident statements be made about the law here. I wonder if anyone who is actually a lawyer would like to chime in.

ChrisArchitectover 1 year ago

[dupe]Discussion here: <a href="https://news.ycombinator.com/item?id=38781941">https://news.ycombinator.com/item?id=38781941</a>

评论 #38791152 未加载

skcover 1 year ago

Kind of ironic that the NYT will still have to host articles extolling the virtues of OpenAI as it continues to expand and upend industries

fuzzfactorover 1 year ago

What if you were one of the people who read the Times from cover-to-cover every day and seriously tries to remember as much as possible because you consider it a trustworthy reference source?And if you were called upon to solve a problem based on knowledge you consider trustworthy, what would you come up with?What if you were even specifically directed to utilize only findings gleaned from the Times exclusively?And what if that was your only lifetime source of information whatsoever for some reason?

评论 #38796905 未加载

throwuwuover 1 year ago

If they lose they should delete the NY Times

exabrialover 1 year ago

I'm actually fine with this. Copyright holders never consented to having their work used in this manner.

Kim_Bruningover 1 year ago

Huh, is this a big misunderstanding?The copilot screenshot they gave in the ars-technica article as well as many of the screenshots in the NYT article seems like it's actually displaying correct behavior for browsing the web.In these cases the system is more or less acting as a user agent (browser). AFAICT the NYT server actually gave that data to the user agent when it asked politely (200 OK, presumably). The user agent then displayed it to the user, which the user agent may do in any way it deems fit or appropriate.There's only one or two cases where this has gone against the user or user agent, in very specific circumstances. The server can eg say 403 Forbidden whenever it likes, so if it returns a 200 OK, what's a user agent to do other than believe it at its word?The only twist is that this user agent is now Imbued With AI (tm)(r)(c) . I don't think that really makes a difference here. If that's all this is, then it's more related to legal fights over certain ad-blockers or readability, which have similar functionality.* <a href="https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf" rel="nofollow">https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...</a> , eg. page 45; I mean it says "Model: Web Browsing" at the top, and "Finished browsing" right on the page. That particular subsystem is now integrated, so the UI/UX is different now, but IIRC the link was in the pulldown?

评论 #38797590 未加载

amadeuspagelover 1 year ago

Two not-so subtle paragraphs about the "partnership" between Microsoft and OpenAI:> 15. Microsoft Corporation is a Washington corporation with a principal place of business and headquarters in Redmond, Washington. Microsoft has invested at least $13 billion in OpenAI Global LLC in exchange for which Microsoft will receive 75% of that company’s profits until its investment is repaid, after which Microsoft will own a 49% stake in that company.> 16. Microsoft has described its relationship with the OpenAI Defendants as a “partnership.” This partnership has included contributing and operating the cloud computing services used to copy Times Works and train the OpenAI Defendants’ GenAI models. It has also included, upon information and belief, substantial technical collaboration on the creation of those models. Microsoft possesses copies of, or obtains preferential access to, the OpenAI Defendants’ latest GenAI models that have been trained on and embody unauthorized copies of the Times Works. Microsoft uses these models to provide infringing content and, at times, misinformation to users of its products and online services. During a quarterly earnings call in October 2023, Microsoft noted that “more than 18,000 organizations now use Azure OpenAI Service, including new-to-Azure customers.”

dewbriteover 1 year ago

Summarizing the article: The most damning thing here is the "ChatGPT as a search engine" feature, which appears to run an agent which performs a search, visits pages, and returns the best results.In doing this, it is bypassing the NY Times paywall, and you can read full articles from today by repeatedly asking for the next paragraph.

ranting-mothover 1 year ago

Let's try the "reverse the gender" card.Let's say OpenAI was trained on all the Windows source code (without approval from MS).GPT could pretty much replicate the windows code with even not that clever prompt by any user. "Write an OS CreateProcess function like Windows 10 source code would have."It would infuriate MS to put it mildly, enough to start a lawsuit.I know the license to the MS source code and NYT articles aren't the same.

ctothover 1 year ago

Isn't the fundamental issue here that the NYT was available in Common Crawl?If they didn't want to share their content, why did they allow it to be scraped?If they did want to share their content, why do they care (hint: $88 billion)?Or is it that they wanted to share their content with Google and other search engines in order to bring in readers but now that an AI was trained on it they are angry?What wrong thing did OpenAI do specific to using Common Crawl?Didn't most companies use Common Crawl? Excepting Google, who had already scraped the whole damn Internet anyway and just used their search index?Is it legal or not to scrape the web?If I scrape the web, is it legal to train a transformer on it? Why or why not?To me, this is an incredibly open-and-shut case. You put something on the web, people will read that something. If that is illegal, Google is illegal.Oh, and do you see the part in the article where they are butthurt that it can reproduce the NYT style?> "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.Mimics its expressive style. Oh golly the robots can write like they're smug NYT reporters now--better sue!It appears that the NYT changed their terms of service in August to disallow their content in Common Crawl[0]. Wasn't GPT-4 trained far before August?0]: <a href="https://www.adweek.com/media/the-new-york-times-updates-terms-of-service-to-prevent-ai-scraping-its-content/" rel="nofollow">https://www.adweek.com/media/the-new-york-times-updates-term...</a>

评论 #38795507 未加载

wseqyrkuover 1 year ago

Sounds like 2024 is gonna be the year of lawsuits like this.

poormanover 1 year ago

Sad to say but I would believe a hallucination from OpenAI before I would believe anything that comes out of the NY Times. I mean the confidence interval for the NY Times is what again?

thunkshift1over 1 year ago

I think this could be shakedown. They want money/licensing from openai the way apple was offering news companies recently. High probability this is settled out of court.

fbhabbedover 1 year ago

This is getting a bit out of hand isn't it.

joshxyzover 1 year ago

only winner here is the lawyers of both parties laughing their way to the bank.god i love this era, so much grey area in these edge technologies.

starchild3001over 1 year ago

I asked an LLM to summarize the 69 page lawsuit. It does a decent job. Didn't infringe on any copyrights in the process :)Here is a summary of the key points from the legal complaint filed by The New York Times against Microsoft and OpenAI:The New York Times filed a copyright infringement lawsuit against Microsoft and OpenAI alleging that their generative AI tools like ChatGPT and Bing Chat infringe on The Times's intellectual property rights by copying and reproducing Times content without permission to train their AI models.The Times invests enormous resources into producing high-quality, original journalism and has over 3 million registered copyrighted works. Its business models rely on subscriptions, advertising, licensing fees, and affiliate referrals, all of which require direct traffic to NYTimes.com.The complaint alleges Microsoft and OpenAI copied millions of Times articles, investigations, reviews, and other content on a massive scale without permission to train their AI models. The models encode and "memorize" copies of Times works which can be retrieved verbatim. Defendants' tools like ChatGPT and Bing then display this protected content publicly.OpenAI promised to freely share its AI research when founded in 2015 but pivoted to a for-profit model in 2019. Microsoft invested billions into OpenAI and provides all its cloud computing. Their partnership built special systems to scrape and store training data sets with Times content emphasized.The complaint includes many examples of the AI models reciting verbatim excerpts of Times articles, showing they were trained on this data. It also shows the models fabricating quotes and attributing them to the Times.Microsoft's integration of the OpenAI models into Bing Chat and other products boosted its revenues and market value tremendously. OpenAI's release of ChatGPT also made it hugely valuable. But their commercial success relies significantly on unlicensed use of Times works.The Times attempted to negotiate a deal with Microsoft and OpenAI but failed, hence this lawsuit. Generating substitute products that compete with inputs used to train models does not qualify as "fair use" exemptions to copyright. The Times seeks damages and injunctive relief.In summary, The New York Times alleges Microsoft and OpenAI's AI products infringe Times copyrights on a massive scale to unfairly benefit at The Times's expense. The Times invested heavily in content creation and controls how its work is used commercially. Using Times content without payment or permission to build competitive tools violates its rights under copyright law.

评论 #38796226 未加载

thatgerhardover 1 year ago

what a "surprisingly" short-signed approach by a obsolete media giant

visargaover 1 year ago

Wondering who tf reads old NYT articles? News become old really fast. chatGPT is months or years behind.

kolinkoover 1 year ago

Worth noting, that - at least the screenshot - shows an example of browsing functionality used to go around paywalls, not that the model itself is trained, or can reproduce the articles really.IIRC this was the reason why the browsing plugin was disabled for some time after its introduction - they were patching up this hole.

test6554over 1 year ago

How do I put this... Whether NYT is right or wrong, their case should be dismissed.

andrewstuartover 1 year ago

Means nothing.An ambit claim that Rupert is throwing out there to see what he can get.

cynicalsecurityover 1 year ago

Nothing will come out of it. NY times will lose.

ryukopostingover 1 year ago

> All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall.Not to be pedantic, but NYT has the least robust paywall I've ever seen. Just turn on reader mode in your browser. Simple. I get that it's still tresspassing if I walk into an unlocked house, but NYT could try installing a lock that isn't made of confetti and uncooked pasta.

munchinatorover 1 year ago

It's interesting to me the ambiguous attitude people have to reproducing news content. Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim.And this seems to be tolerated as the norm. And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.And the reason I bring this up, is that it seems like Open AI has the same attitude: scraping news articles is OK, or at worst a gray area, but what if they were also scraping, for example, Netflix content to use as part of their training set?

评论 #38792784 未加载

评论 #38792348 未加载

评论 #38792710 未加载

评论 #38791681 未加载

评论 #38792157 未加载

评论 #38793165 未加载

评论 #38791890 未加载

评论 #38792088 未加载

评论 #38792395 未加载

评论 #38791864 未加载

评论 #38792107 未加载

评论 #38793685 未加载

评论 #38792561 未加载

评论 #38792518 未加载

评论 #38792580 未加载

评论 #38793330 未加载

评论 #38794945 未加载

评论 #38791754 未加载

评论 #38794714 未加载

评论 #38793478 未加载

评论 #38792507 未加载

评论 #38793935 未加载

评论 #38792624 未加载

评论 #38794323 未加载

评论 #38793756 未加载

评论 #38792351 未加载

评论 #38791842 未加载

评论 #38791967 未加载

评论 #38793943 未加载

评论 #38793441 未加载

评论 #38793706 未加载

评论 #38794078 未加载

评论 #38792499 未加载

评论 #38792181 未加载

评论 #38793232 未加载

评论 #38793770 未加载

评论 #38792180 未加载

评论 #38794194 未加载

评论 #38793716 未加载

评论 #38793967 未加载

评论 #38792646 未加载

评论 #38794255 未加载

评论 #38791682 未加载

评论 #38794086 未加载

评论 #38792249 未加载

评论 #38793916 未加载

评论 #38798827 未加载

评论 #38792379 未加载

评论 #38793485 未加载

评论 #38793713 未加载

评论 #38794456 未加载

评论 #38794949 未加载

评论 #38792021 未加载

评论 #38792106 未加载

评论 #38793142 未加载

评论 #38793229 未加载

评论 #38792472 未加载

nektroover 1 year ago

oh how joyous that would be. I so hope they win

altals2023over 1 year ago

Won't hold in court. GPT is a platform mainly providing answer to private individuals asking. Is like you ask a professor a question and he answered verbatim what copyrighted materials available (due to photographic memory) word for word back to you. Now if you take this answer and write a book or publish enmass on blogs for example, then you are the one should be sued by NYT. If GPT use the exact same wordings and publish it out to evetyone visiting their page, then that is on OpenAI.

评论 #38791033 未加载

评论 #38791317 未加载

评论 #38791367 未加载

评论 #38790871 未加载

评论 #38791244 未加载

andy99over 1 year ago

I don't think the lawsuit has any merit, but I'd still like to encourage Sam Altman et al, if they really care about the greater good, to go Keyser Söze and immediately release torrents of the weights and source code for GPT-4 under GPL.

评论 #38796088 未加载

评论 #38796055 未加载

评论 #38796332 未加载

评论 #38796178 未加载

评论 #38796972 未加载

atleastoptimalover 1 year ago

It's obviously a frivolous suit that will only net at best a ceremonial victory for NYTimes: 8 figure max payout and a promise to not use NYtimes material in the future.The trajectory and value to society of OpenAI vs NYtimes could not be greater. They have won no favors in the court of public opinion with their frequent misinformation. It's all just a big waste of time, the last of the old guard flailing against the march of progress.And even hypothetially if they managed to get OpenAI to delete ChatGPT they'd be hated forever.

评论 #38790865 未加载

评论 #38790497 未加载

评论 #38790465 未加载

评论 #38790456 未加载

outside1234over 1 year ago

Seems reasonable - they probably broke the TOS of the site

评论 #38790769 未加载

评论 #38791023 未加载

评论 #38790502 未加载

hazmazlazover 1 year ago

I'd rather have GPT than the NY Times, if I had to choose between one or the other.

unstatusthequoover 1 year ago

I’d be happy if the NYT was deleted. I find it has very little use as a source of anything, much like most mainstream media.

cycrutchfieldover 1 year ago

I read a NYT article and publish a summary of facts that I learned: totally legit.Train a model on NYT text that outputs a summary of facts that it learned: OMG literally murder.

评论 #38790832 未加载

评论 #38790723 未加载

评论 #38790679 未加载

评论 #38790697 未加载

kazinatorover 1 year ago

Should be: "NY Times wants OpenÄI to delete all GPT instances". You wouldn't want the hapless rabble misreading it as an "aiii" diphthong.

评论 #38791767 未加载

j0hnylover 1 year ago

I hope the world can rally and move past these anachronistic ideas of intellectual property.

ehwhwhwhahhwhover 1 year ago

NYT could also fix the issue by deleting NYT itself. Could be a better result for humanity as well. Thanks.

82 comments

rich_sashaover 1 year ago

评论 #38794848 未加载

评论 #38791295 未加载

评论 #38795262 未加载

评论 #38791630 未加载

评论 #38796673 未加载

评论 #38792614 未加载

评论 #38801551 未加载

评论 #38792823 未加载

评论 #38791407 未加载

评论 #38797883 未加载

评论 #38797812 未加载

评论 #38791376 未加载

评论 #38793139 未加载

评论 #38791920 未加载

评论 #38793420 未加载

评论 #38793527 未加载

评论 #38795499 未加载

评论 #38796181 未加载

评论 #38791734 未加载

评论 #38796621 未加载

评论 #38793686 未加载

评论 #38800068 未加载

评论 #38795004 未加载

评论 #38800044 未加载

评论 #38791615 未加载

评论 #38795802 未加载

评论 #38791798 未加载

评论 #38791225 未加载

groceryheistover 1 year ago

评论 #38791026 未加载

评论 #38791154 未加载

评论 #38791093 未加载

评论 #38792336 未加载

评论 #38790949 未加载

评论 #38795482 未加载

评论 #38794612 未加载

评论 #38795432 未加载

评论 #38798083 未加载

评论 #38792854 未加载

评论 #38791203 未加载

评论 #38795162 未加载

评论 #38790895 未加载

a_wild_dandanover 1 year ago

评论 #38793246 未加载

评论 #38793294 未加载

评论 #38796362 未加载

评论 #38793611 未加载

评论 #38798012 未加载

评论 #38797816 未加载

评论 #38793568 未加载

评论 #38799292 未加载

评论 #38795944 未加载

评论 #38793687 未加载

throwaway4goodover 1 year ago

评论 #38793122 未加载

评论 #38797447 未加载

评论 #38792895 未加载

habosaover 1 year ago

评论 #38795149 未加载

评论 #38793272 未加载

评论 #38801041 未加载

评论 #38793410 未加载

评论 #38793383 未加载

wouldbecouldbeover 1 year ago

评论 #38791696 未加载

评论 #38792035 未加载

评论 #38802465 未加载

aurareturnover 1 year ago

评论 #38791273 未加载

评论 #38790668 未加载

评论 #38790628 未加载

logicchainsover 1 year ago

评论 #38791290 未加载

评论 #38792769 未加载

评论 #38792267 未加载

评论 #38791271 未加载

fasterikover 1 year ago

评论 #38790665 未加载

评论 #38790866 未加载

评论 #38790710 未加载

评论 #38791013 未加载

评论 #38803753 未加载

kragenover 1 year ago

Kon-Pekiover 1 year ago

评论 #38796397 未加载

sunpazedover 1 year ago

bdd8f1df777bover 1 year ago

评论 #38791628 未加载

评论 #38791587 未加载

ssijakover 1 year ago

评论 #38791905 未加载

JackFrover 1 year ago

评论 #38792089 未加载

评论 #38792068 未加载

tarrudaover 1 year ago

Would be funny if NT Times won this and all commercial LLMs were shut down.Then LLMs would be distributed only via torrents, like most copyright infringing media.

评论 #38791043 未加载

评论 #38791655 未加载

评论 #38791035 未加载

elifover 1 year ago

At some point the burden of carrying 100 year old copywriter/patent law will become so onerous a burden on the pace of progress that its enforcement will be antihuman.

评论 #38793445 未加载

chris_wotover 1 year ago

评论 #38791875 未加载

jrockwayover 1 year ago

评论 #38791809 未加载

评论 #38791328 未加载

评论 #38791780 未加载

评论 #38791329 未加载

cpt100over 1 year ago

Given that Harvard President plagiarized her way into becoming a President, how can we be sure that NYT doesn't plagiarize and take content from X and other places to quickly chrun out daily news?

strangusover 1 year ago

Next up, Microsoft acquires the New York Times forming MSNYT

评论 #38790571 未加载

评论 #38790981 未加载

dangover 1 year ago

评论 #38797514 未加载

globular-toastover 1 year ago

评论 #38791889 未加载

weikjuover 1 year ago

sackfieldover 1 year ago

sylwareover 1 year ago

labradorover 1 year ago

chmod600over 1 year ago

评论 #38791371 未加载

biglyburritoover 1 year ago

评论 #38790558 未加载

评论 #38790393 未加载

djhope99over 1 year ago

munchinatorover 1 year ago

评论 #38791627 未加载

评论 #38791536 未加载

评论 #38791612 未加载

Mountain_Skiesover 1 year ago

lwhiover 1 year ago

Surely there's no chance OpenAI would agree to this?Isn't it more likely that the company buys the NYT?

rimeiceover 1 year ago

fsckboyover 1 year ago

评论 #38791280 未加载

bigmattystylesover 1 year ago

评论 #38795056 未加载

评论 #38794976 未加载

mark_l_watsonover 1 year ago

评论 #38795121 未加载

4death4over 1 year ago

评论 #38795276 未加载

kweingarover 1 year ago

ml-anonover 1 year ago

ITT: the perfect storm of people who dont know how LLMs work, dont know how the brain works and dont know how the law works.

sensanatyover 1 year ago

I love seeing all the AI sycophants squirm at this news.Here's to hoping NYT wins this one and gets everything they ask for, and more!

评论 #38794268 未加载

frakrxover 1 year ago

评论 #38791679 未加载

andy99over 1 year ago

评论 #38793914 未加载

评论 #38793942 未加载

评论 #38793938 未加载

评论 #38796319 未加载

评论 #38793881 未加载

1f60cover 1 year ago

I believe that ChatGPT is fair use, just on a much larger scale than we're used to.

chucke1992over 1 year ago

While copyright laws are ancient and outdated - and probably should be reformed or removed altogether - this lawsuit might be entertaining.

shp0ngleover 1 year ago

Microsoft is one of the companies that love to use copyright to get their way, BSA is known software mafia, so I'm not at all sympathetic to them.

dash2over 1 year ago

There’s an awful lot of confident statements be made about the law here. I wonder if anyone who is actually a lawyer would like to chime in.

ChrisArchitectover 1 year ago

[dupe]Discussion here: <a href="https://news.ycombinator.com/item?id=38781941">https://news.ycombinator.com/item?id=38781941</a>

评论 #38791152 未加载

skcover 1 year ago

Kind of ironic that the NYT will still have to host articles extolling the virtues of OpenAI as it continues to expand and upend industries

fuzzfactorover 1 year ago

评论 #38796905 未加载

throwuwuover 1 year ago

If they lose they should delete the NY Times

exabrialover 1 year ago

I'm actually fine with this. Copyright holders never consented to having their work used in this manner.

Kim_Bruningover 1 year ago

评论 #38797590 未加载

amadeuspagelover 1 year ago

dewbriteover 1 year ago

ranting-mothover 1 year ago

ctothover 1 year ago

评论 #38795507 未加载

wseqyrkuover 1 year ago

Sounds like 2024 is gonna be the year of lawsuits like this.

poormanover 1 year ago

Sad to say but I would believe a hallucination from OpenAI before I would believe anything that comes out of the NY Times. I mean the confidence interval for the NY Times is what again?

thunkshift1over 1 year ago

I think this could be shakedown. They want money/licensing from openai the way apple was offering news companies recently. High probability this is settled out of court.

fbhabbedover 1 year ago

This is getting a bit out of hand isn't it.

joshxyzover 1 year ago

only winner here is the lawyers of both parties laughing their way to the bank.god i love this era, so much grey area in these edge technologies.

starchild3001over 1 year ago

评论 #38796226 未加载

thatgerhardover 1 year ago

what a "surprisingly" short-signed approach by a obsolete media giant

visargaover 1 year ago

Wondering who tf reads old NYT articles? News become old really fast. chatGPT is months or years behind.

kolinkoover 1 year ago

test6554over 1 year ago

How do I put this... Whether NYT is right or wrong, their case should be dismissed.

andrewstuartover 1 year ago

Means nothing.An ambit claim that Rupert is throwing out there to see what he can get.

cynicalsecurityover 1 year ago

Nothing will come out of it. NY times will lose.

ryukopostingover 1 year ago

munchinatorover 1 year ago

评论 #38792784 未加载

评论 #38792348 未加载

评论 #38792710 未加载

评论 #38791681 未加载

评论 #38792157 未加载

评论 #38793165 未加载

评论 #38791890 未加载

评论 #38792088 未加载

评论 #38792395 未加载

评论 #38791864 未加载

评论 #38792107 未加载

评论 #38793685 未加载

评论 #38792561 未加载

评论 #38792518 未加载

评论 #38792580 未加载

评论 #38793330 未加载

评论 #38794945 未加载

评论 #38791754 未加载

评论 #38794714 未加载

评论 #38793478 未加载

评论 #38792507 未加载

评论 #38793935 未加载

评论 #38792624 未加载

评论 #38794323 未加载

评论 #38793756 未加载

评论 #38792351 未加载

评论 #38791842 未加载

评论 #38791967 未加载

评论 #38793943 未加载

评论 #38793441 未加载

评论 #38793706 未加载

评论 #38794078 未加载

评论 #38792499 未加载

评论 #38792181 未加载

评论 #38793232 未加载

评论 #38793770 未加载

评论 #38792180 未加载

评论 #38794194 未加载

评论 #38793716 未加载

评论 #38793967 未加载

评论 #38792646 未加载

评论 #38794255 未加载

评论 #38791682 未加载

评论 #38794086 未加载

评论 #38792249 未加载

评论 #38793916 未加载

评论 #38798827 未加载

评论 #38792379 未加载

评论 #38793485 未加载

评论 #38793713 未加载

评论 #38794456 未加载

评论 #38794949 未加载

评论 #38792021 未加载

评论 #38792106 未加载

评论 #38793142 未加载

评论 #38793229 未加载

评论 #38792472 未加载

nektroover 1 year ago

oh how joyous that would be. I so hope they win

altals2023over 1 year ago

评论 #38791033 未加载

评论 #38791317 未加载

评论 #38791367 未加载

评论 #38790871 未加载

评论 #38791244 未加载

andy99over 1 year ago

评论 #38796088 未加载

评论 #38796055 未加载

评论 #38796332 未加载

评论 #38796178 未加载

评论 #38796972 未加载

atleastoptimalover 1 year ago

评论 #38790865 未加载

评论 #38790497 未加载

评论 #38790465 未加载

评论 #38790456 未加载

outside1234over 1 year ago

Seems reasonable - they probably broke the TOS of the site

评论 #38790769 未加载

评论 #38791023 未加载

评论 #38790502 未加载

hazmazlazover 1 year ago

I'd rather have GPT than the NY Times, if I had to choose between one or the other.

unstatusthequoover 1 year ago

I’d be happy if the NYT was deleted. I find it has very little use as a source of anything, much like most mainstream media.

cycrutchfieldover 1 year ago

I read a NYT article and publish a summary of facts that I learned: totally legit.Train a model on NYT text that outputs a summary of facts that it learned: OMG literally murder.

评论 #38790832 未加载

评论 #38790723 未加载

评论 #38790679 未加载

评论 #38790697 未加载

kazinatorover 1 year ago

Should be: "NY Times wants OpenÄI to delete all GPT instances". You wouldn't want the hapless rabble misreading it as an "aiii" diphthong.

评论 #38791767 未加载

j0hnylover 1 year ago

I hope the world can rally and move past these anachronistic ideas of intellectual property.

ehwhwhwhahhwhover 1 year ago

NYT could also fix the issue by deleting NYT itself. Could be a better result for humanity as well. Thanks.