TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The New York Times is suing OpenAI and Microsoft for copyright infringement

593 点作者 ssgodderidge超过 1 年前

83 条评论

kbos87超过 1 年前
Solidly rooting for NYT on this - it’s felt like many creative organizations have been asleep at the wheel while their lunch gets eaten for a second time (the first being at the birth of modern search engines.)<p>I don’t necessarily fault OpenAI’s decision to initially train their models without entering into licensing agreements - they probably wouldn’t exist and the generative AI revolution may never have happened if they put the horse before the cart. I do think they should quickly course correct at this point and accept the fact that they clearly owe something to the creators of content they are consuming. If they don’t, they are setting themselves up for a bigger loss down the road and leaving the door open for a more established competitor (Google) to do it the right way.
评论 #38783263 未加载
评论 #38783582 未加载
评论 #38784890 未加载
评论 #38785253 未加载
评论 #38784227 未加载
评论 #38785142 未加载
评论 #38784651 未加载
评论 #38787613 未加载
评论 #38796547 未加载
评论 #38783182 未加载
评论 #38783110 未加载
DamnInteresting超过 1 年前
I have deeply mixed feelings about the way LLMs slurp up copyrighted content and regurgitate it as something &quot;new.&quot; As a software developer who has dabbled in machine learning, it is exciting to see the field progress. But I am also an author with a large catalog of writings, and my work has been captured by at least one LLM (according to a tool that can allegedly detect these things).<p>Overall, current LLMs remind me of those bottom-feeder websites that do no original research--those sites that just find an article they like, lazily rewrite it, introduce a few errors, then maybe paste some baloney &quot;sources&quot; (which always seems to disinclude the actual original source). That mode of operation tends to be technically legal, but it&#x27;s parasitic and lazy and doesn&#x27;t add much value to the world.<p>All that aside, I tend to agree with the hypothesis that LLMs are a fad that will mostly pass. For professionals, it is really hard to get past hallucinations and the lack of citations. Imagine being a perpetual fact-checker for a very unreliable author. And laymen will probably mostly use LLMs to generate low-effort content for SEO, which will inevitably degrade the quality of the same LLMs as they breed with their own offspring. &quot;Regression to mediocrity,&quot; as Galton put it.
评论 #38785104 未加载
评论 #38784987 未加载
评论 #38785115 未加载
评论 #38785244 未加载
评论 #38785143 未加载
评论 #38793460 未加载
评论 #38813131 未加载
评论 #38785615 未加载
评论 #38785112 未加载
评论 #38785297 未加载
solardev超过 1 年前
I hope this results in Fair Use being expanded to cover AI training. This is way more important to humanity&#x27;s future than any single media outlet. If the NYT goes under, a dozen similar outlets can replace them overnight. If we lose AI to stupid IP battles in its infancy, we end up handicapping probably the single most important development in human history just to protect some ancient newspaper. Then another country is going to do it anyway, and still the NYT is going to get eaten.
评论 #38783492 未加载
评论 #38783244 未加载
评论 #38784271 未加载
评论 #38783411 未加载
评论 #38783341 未加载
评论 #38784700 未加载
评论 #38783561 未加载
评论 #38783205 未加载
评论 #38783682 未加载
评论 #38783839 未加载
评论 #38783299 未加载
评论 #38784225 未加载
评论 #38783488 未加载
评论 #38783543 未加载
评论 #38785041 未加载
评论 #38783428 未加载
评论 #38784908 未加载
评论 #38789082 未加载
评论 #38783915 未加载
Aurornis超过 1 年前
The arguments about being able to mimic New York Times “style” are weak, but the fact that they got it to emit verbatim NY Times content seems bad for OpenAI:<p>&gt; As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim
评论 #38782650 未加载
评论 #38782327 未加载
评论 #38784378 未加载
评论 #38784326 未加载
评论 #38782291 未加载
评论 #38782456 未加载
评论 #38782293 未加载
评论 #38784381 未加载
评论 #38782375 未加载
wg0超过 1 年前
Google can look up into their index and can remove whatever they want to, within minutes. But how that can be possible for an LLM? That is, &quot;decontaminate&quot; the model from certain parts of the corups? I can only think of excluding the data set from the training and then retrain?<p>As a side note, I think LLM frenzy would be dead in few years, 10 years time frame at max. The rent seeking on these LLMs as of today would no more be a viable or as profitable business model as more inference circuitry gets out in the wild into laptops and phones, more models get released, tweaked by the community and such.<p>People thinking to downvote and dismiss this should see the history of commercial Unix and how that turned out to be today and how almost no workload (other than CAD, Graphics) runs on Windows or Unix including this very forum, I highly doubt is hosted on Windows or a commercial variant of Unix.
评论 #38782600 未加载
评论 #38782575 未加载
评论 #38783122 未加载
评论 #38782586 未加载
评论 #38782871 未加载
评论 #38783646 未加载
评论 #38784963 未加载
评论 #38783324 未加载
评论 #38783291 未加载
评论 #38782816 未加载
评论 #38782655 未加载
评论 #38783608 未加载
blagie超过 1 年前
I think the train has left the station and the ship has sailed. I&#x27;m not sure it&#x27;s possible to put this genie back in the bottle. I had stuff stolen by OpenAI too, and I felt bad about it (and even send them a nasty legal letter when it could output my creative work almost verbatim), but I think at this point, the legal landscape needs to somehow adjust. The Copyright Clause in the US Constitution is clear:<p><i>To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries</i><p>Blocking LLMs on the basis of copyright infringement does NOT promote progress in science and the useful arts. I don&#x27;t think copyright is a useful basis to block LLMs.<p>They do need to be regulated, and quickly, but that regulatory regime should be something different. Not copyright. The concept of OpenAI before it became a frankenmonster for-profit was good. Private failed, and we now need public.
评论 #38782401 未加载
评论 #38782535 未加载
评论 #38782419 未加载
NiloCK超过 1 年前
Interesting.<p>I think the appropriation, privatization, and monetization of &quot;all human output&quot; by a single (corporate) entity is at least shameless, probably wrong, and maybe outright disgraceful.<p>But I think OpenAI (or another similar entity) will succeed via the Sackler defense - OpenAI has too many victims for litigation to be feasible for the courts, so the courts must preemptively decide not to bother with compensating these victims.
评论 #38782713 未加载
评论 #38782531 未加载
评论 #38783842 未加载
评论 #38782580 未加载
batch12超过 1 年前
&gt; The New York Times is suing OpenAI and Microsoft over claims the companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with the outlet’s content.<p>Millions? Damn, they can churn out some content. 13 million[0]!.<p>[0] <a href="https:&#x2F;&#x2F;archive.nytimes.com&#x2F;www.nytimes.com&#x2F;ref&#x2F;membercenter&#x2F;nytarchive.html#:~:text=1851%E2%80%93PRESENT,than%2013%20million%20articles%20total" rel="nofollow">https:&#x2F;&#x2F;archive.nytimes.com&#x2F;www.nytimes.com&#x2F;ref&#x2F;membercenter...</a>.
评论 #38782245 未加载
评论 #38782207 未加载
评论 #38782195 未加载
andy99超过 1 年前
The way to view this kind of parasitism is how we look at patent trolls. When you look at the RIAA&#x2F;MPAA lawsuits, while I don&#x27;t agree with them, at least file sharing was basically a canonical form of copyright infringement.<p>With LLMs we have an aspect of a text corpus that the creators were not using (the language patterns) and had no plans for or even idea that it could be used, and then when someone comes along and uses it, not to reproduce anything but to provide minute iterative feedback in training, they run in to try and extract some money. It&#x27;s parasitism. It doesn&#x27;t benefit society, it only benefits the troll, there is no reason courts should enforce it.<p>Someone should try and show that a NYT article can be generated autoregressively and argue it&#x27;s therefore not copyrightable.
评论 #38782745 未加载
评论 #38783032 未加载
评论 #38787114 未加载
pm90超过 1 年前
NYT article with a lot more context <a href="https:&#x2F;&#x2F;www.nytimes.com&#x2F;2023&#x2F;12&#x2F;27&#x2F;business&#x2F;media&#x2F;new-york-times-open-ai-microsoft-lawsuit.html" rel="nofollow">https:&#x2F;&#x2F;www.nytimes.com&#x2F;2023&#x2F;12&#x2F;27&#x2F;business&#x2F;media&#x2F;new-york-t...</a>
评论 #38782145 未加载
评论 #38782276 未加载
评论 #38782390 未加载
评论 #38782927 未加载
dissident_coder超过 1 年前
Even if they win against openAI, how would this prevent something like a Chinese or Russian LLM from “stealing” their content and making their own superior LLM that isnt weakened by regulation like the ones in the United States.<p>And I say this as someone that is extremely bothered by how easily mass amounts of open content can just be vacuumed up into a training set with reckless abandon and there isn’t much you can do other than put everything you create behind some kind of authentication wall but even then it’s only a matter of time until it leaks anyway.<p>Pandora’s box is really open, we need to figure out how to live in a world with these systems because it’s an un winnable arms race where only bad actors will benefit from everyone else being neutered by regulation. <i>Especially</i> with the massive pace of open source innovation in this space.<p>We’re in a “mutually assured destruction” situation now, but instead of bombs the weapon is information.
评论 #38782674 未加载
评论 #38782384 未加载
评论 #38782747 未加载
评论 #38782424 未加载
评论 #38783040 未加载
评论 #38782513 未加载
评论 #38782414 未加载
评论 #38782567 未加载
评论 #38782611 未加载
评论 #38782470 未加载
评论 #38782592 未加载
评论 #38782480 未加载
评论 #38782976 未加载
评论 #38785389 未加载
评论 #38782402 未加载
评论 #38782578 未加载
评论 #38783981 未加载
评论 #38782993 未加载
评论 #38782412 未加载
评论 #38782428 未加载
breadwinner超过 1 年前
Here&#x27;s the most important part (from NYT story on the lawsuit [1]):<p>In one example of how A.I. systems use The Times’s material, the suit showed that Browse With Bing, a Microsoft search feature powered by ChatGPT, reproduced almost verbatim results from Wirecutter, The Times’s product review site. The text results from Bing, however, did not link to the Wirecutter article, and they stripped away the referral links in the text that Wirecutter uses to generate commissions from sales based on its recommendations.<p>[1] <a href="https:&#x2F;&#x2F;www.nytimes.com&#x2F;2023&#x2F;12&#x2F;27&#x2F;business&#x2F;media&#x2F;new-york-times-open-ai-microsoft-lawsuit.html" rel="nofollow">https:&#x2F;&#x2F;www.nytimes.com&#x2F;2023&#x2F;12&#x2F;27&#x2F;business&#x2F;media&#x2F;new-york-t...</a>
jmyeet超过 1 年前
When the US entered WWI, they couldn&#x27;t build a plane despite inventing them. They had to buy planes from the French. Why? The Wright Brothers patent war [1]. This led to Congress creating a patent pool for avionics that exists to this day.<p>Honestly, I get this feeling about these lawsuits about using content to train LLMs.<p>Think of it this way: in growing up and learning to read and getting an education you read any number of books, articles, Web pages, magazines, etc. You viewed any number of artworks, buildings, cars, vehicles, furniture, etc, many of which might have design patents. We have such silliness as it being illegal to distribute photos commercially of the Eiffel Tower at night [2].<p>What&#x27;s the differnce between training a model on text and images and educating a person with text and images, really? If I read too many NYT articles, am I going to get sued for using too much &quot;training data&quot;?<p>Currently we need copious quantities of training data for LLMs. I believe this is because we&#x27;re in the early days of this tech. I mean no person has read millions of articles or books. At some point models will get better with substantially smaller training sets. And then, how many articles is too many as far as these suits go?<p>[1]: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Wright_brothers_patent_war" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Wright_brothers_patent_war</a><p>[2]: <a href="https:&#x2F;&#x2F;www.travelandleisure.com&#x2F;photography&#x2F;illegal-to-take-eiffel-tower-photos-at-night" rel="nofollow">https:&#x2F;&#x2F;www.travelandleisure.com&#x2F;photography&#x2F;illegal-to-take...</a>
评论 #38783378 未加载
soultrees超过 1 年前
Here come the innovation sponges.<p>If this goes through then the models that the general public have access are going to be severely neutered while the ownership class will have a much better model that will never see the light of day due to legal risks and claims like this - therefore increasing the disparity between us all.
评论 #38785762 未加载
评论 #38782906 未加载
pelorat超过 1 年前
Time to move LLM training to Japan who passed a law giving free reign to train LLMs on copyrighted material.
评论 #38789464 未加载
评论 #38782366 未加载
mritchie712超过 1 年前
I&#x27;d bet they win, but how do you possibly measure the dollar amount? If you strip out 100% of NYT content from GPT-4, I don&#x27;t think you&#x27;d notice a difference. But if you go domain by domain and continue stripping training data, the model will eventually get worse.
评论 #38782188 未加载
评论 #38782879 未加载
评论 #38782874 未加载
whichfawkes超过 1 年前
It seems weird to sue an AI company because their tool &quot;can recite [copyrighted]&quot; content verbatim.<p>If I paid a human to recite the whole front page of the New York Times to me, they could probably do it. There&#x27;s nothing infringing about that. However, if I videotape them reciting the front page of the New York Times and start selling that video, then <i>I</i>&#x27;d be infringing on the copyright.<p>The guy that I paid to tell me about what NYT was saying didn&#x27;t do anything wrong. Whether there&#x27;s any copyright infringement would depend what I did with the output.
评论 #38784806 未加载
crowcroft超过 1 年前
Will be interesting to see where this ends up.<p>If I scrape the NYT content, and then commercialize a service that lets users query that content through an API (occasionally returning verbatim extracts) without any agreement from or payment to the NYT, that would be illegal.<p>It&#x27;s not obvious to me why putting an LLM in the middle of the process changes that.
评论 #38784493 未加载
mrweasel超过 1 年前
&gt; “copying and using millions” of the publication’s articles and now “directly compete” with its content as a result.<p>The New York Times doesn&#x27;t have a lot faith in the quality of their own content. How on earth is ChatGPT going to go out into the world a do reporting from Gaza or Ukraine? How is it going to go to the presidents press conference and ask questions? ChatGPT cannot produce original content in the same way a newspaper can. The fact that the NYT seems to believe that ChatGPT can compete says a lot about how they write their articles or their lack of understanding of how LLMs work.<p>Now I do believe that OpenAI could at least have asked the newspapers before just scraping their content, but I think they knew that that would have undermined their business model, which tells you something about how tech companies work.
JCM9超过 1 年前
These lawsuits could end up being a nightmare for AI companies if the plaintiffs are successful. One can’t easily just remove the content from a model like you can from a website if someone sends you a takedown notice. The content is deeply embedded inside the mathematical relationships within the model. You’d basically have to retrain the whole model again sans the offending data. Given the cost to retrain just a few successful claims would destroy any business built around making money off these models.<p>The times appears to have a strong case here with their complaint showing long verbatim passages being produced by ChatGPT that go far beyond any reasonable claim of fair use. This will be an interesting case to watch that could shape the whole Generative AI space.
lp4vn超过 1 年前
For me it&#x27;s quite obvious that if you make a profit from an engine that has as an input copyrighted material, then you owe something to the owner of this copyrighted content. We have seen this same problem with artists claiming stable diffusion engines were using their art.
评论 #38782459 未加载
评论 #38782358 未加载
评论 #38783498 未加载
评论 #38782398 未加载
octacat超过 1 年前
Someone should train some AI on decompiled code of Windows (not encouraging, but it would be interesting). Copyright is important for corpos when it protects their interests. Producing exact text as in NYT articles is pretty much a copyright violation. At least the last time the companies were trying to blame each other that their Java API implementations look pretty similar.<p>Even for open source code you cannot just remove the authors and license, replace some functions and say &quot;oh, it is my code now&quot;. Only public domain code would allow these. But with copilot you could.
prepostertron超过 1 年前
Surprised they don&#x27;t mention Bard anywhere in the article. I wonder if the NYT has worked out some sort of licensing deal with Google for Bard, or if Bard isn&#x27;t trained on NYT data?<p>The lawsuit mentions this, so maybe they did work out some agreement to license their data: &quot;For months, The Times has attempted to reach a negotiated agreement with Defendants, in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products (including the news products developed by Google, Meta, and Apple).&quot;
评论 #38783075 未加载
desireco42超过 1 年前
This is just rent seeking from dying media instead of working on creating something new in my view.<p>AI indeed is reading and using material sa a source, but is deriving results based on that material. I think this should be allowed, but now it is a fight who has better paid politicians pretty much.<p>I am open to hear other thoughts.
评论 #38782547 未加载
KHRZ超过 1 年前
If I ask an LLM &quot;repeat this sentence: [copyrighted sentence]&quot;, is that copyright infringement by the LLM, and recorders such as cameras and parrot toys, or middle scooler troll logic? Because apparantly this is the argument they want to take on Microsoft Bing with.
JCM9超过 1 年前
The challenge for all these AI companies is that the only thing of value for building a defensible commercial product is having proprietary datasets for training. With the underlying techniques and algorithms all being rapidly commoditized the power lies in who holds and owns that data. Like all other ML “revolutions” it’s the training data that matters and if one doesn’t have access to training data others don’t have then you’ll soon be toast.
评论 #38782272 未加载
halukakin超过 1 年前
For many of our google searches, the first results tend to be wikipedia, instagram, etc... We click on those clicks and both google and the clicked website get a share of our traffic. So it is somewhat fair.<p>But in current AI situation, wikipedia, nytimes, stackoverflow etc are getting a pretty unfair deal. Probably all major text based outlets are seeing a drop in their numbers now...
notfed超过 1 年前
And here I am thinking it&#x27;d be amazing to have an AI that can on-demand read me every novel ever written. It&#x27;d be even cooler to jump into a text adventure game of any novel and have it actually follow the original text.<p>I guess that clashes with our copyright world. (Is there hope of some kind of Netflix&#x2F;Spotify model, with fractional royalties?)
评论 #38782928 未加载
评论 #38783176 未加载
nojvek超过 1 年前
&gt; For example, in 2019, The Times published a Pulitzer-prize winning, five-part series on predatory lending in New York City’s taxi industry. The 18-month investigation included 600 interviews, more than 100 records requests, large-scale data analysis, and the review of thousands of pages of internal bank records and other documents, and ultimately led to criminal probes and the enactment of new laws to prevent future abuse.<p>&gt; OpenAI had no role in the creation of this content, yet with minimal prompting, will recite large portions of it verbatim.<p>This is the smoking gun. GPT-4 is a large model and hence highly likely to reproduce content. They have many such examples in the court filing <a href="https:&#x2F;&#x2F;nytco-assets.nytimes.com&#x2F;2023&#x2F;12&#x2F;NYT_Complaint_Dec2023.pdf" rel="nofollow">https:&#x2F;&#x2F;nytco-assets.nytimes.com&#x2F;2023&#x2F;12&#x2F;NYT_Complaint_Dec20...</a><p>IANAL but that&#x27;s a slam dunk of copyright violation.<p>NYT will likely win.<p>Also why OpenAI should not go YOLO scaling up to GPT-5 which will likely recite more copyrighted content. More parameters, more memorization.
andy99超过 1 年前
When it comes to foundation models I think there needs to be a distinction between potential and actual infringement. You can use a broadly trained foundation model to generate copyright infringing content, just like you can use your brain to do so. But the fact that a model <i>can</i> generate such content doesn&#x27;t mean it infringes by its mere existence. <a href="https:&#x2F;&#x2F;www.marble.onl&#x2F;posts&#x2F;general_technology_doesnt_violate_copyright.html" rel="nofollow">https:&#x2F;&#x2F;www.marble.onl&#x2F;posts&#x2F;general_technology_doesnt_viola...</a>
meroes超过 1 年前
There are verbatim and almost verbatim copyrighted works, lengthier than any fair use I’ve seen permissible, outputted which they charge money for and don’t have licensing for.<p>What am I missing?
lokar超过 1 年前
When I was studying AI in grad school many years ago getting good big data sets was always an issue. It never occurred to me to just copy one without permission.
评论 #38784168 未加载
Imnimo超过 1 年前
This looks a lot more convincing to me than the Copilot lawsuit or the Sarah Silverman one. This suit shows ChatGPT reciting large amounts of NYT articles - not just little snippets of code or the ability to answer questions about Silverman&#x27;s book.<p>It feels like even if training on copyrighted data is fair use (and I think it should be), that wouldn&#x27;t give you a pass on regurgitating that training data to anyone who asks.
aomix超过 1 年前
Is there a decent guess at how much training data for ChatGPT is copyrighted work and subject to being removed depending on a few court cases? GPT4 is supposed to be an order of magnitude larger than the open source models that use essentially everything that can be used without asking. So that whole magnitude?
mvcalder超过 1 年前
Does anyone know what the copyright status of LLM generated content is? That is, if I feed a NYT article into GPT4 and say, summarize this article, and then publish that summary, is there argument or precedent that says that is or is not copyright infringement? Asking for a friend.
评论 #38782239 未加载
评论 #38782222 未加载
评论 #38787206 未加载
评论 #38782247 未加载
评论 #38782360 未加载
robg超过 1 年前
Can someone explain the technical difference between what search engines do to index newspapers versus what is being claimed here? Is the difference as simple as me being able to get summaries and content from a newspaper from GPT without needing to visit their website?
评论 #38782494 未加载
评论 #38782391 未加载
评论 #38782347 未加载
评论 #38782229 未加载
fennecbutt超过 1 年前
He he, people will wanna treat LLM learning different to their own learning.<p>I think it&#x27;s fine, as long as it was fed publicly accessible content, without any payment or subscription then it&#x27;s accessible to an LLM as it is to you and I and that&#x27;s fair.<p>And for the people that screech about LLMs being different because they can mass produce derivative works; first of all, ALL works are derivative and if machine produced works are compelling enough to compete with human produced ones then clearly humans need to get better at it.<p>The automatic loom took over from weavers cause it was better, if it wasn&#x27;t then people would still work as weavers.
anigbrowl超过 1 年前
LOL fuck the NYT<p>That&#x27;s like suing someone who had an NYT subscription and read the paper daily for occasionally quoting a choice phrase verbatim. I&#x27;ve been quite critical of AIs impact on the livelihood of artists (whose economic position is precarious to start with, and who are now faced with replacement by machine generated art) but at the same time I reject the copyright complaint completely. Transformers are very obviously doing something else, similar to how a human learns and recreates; the key difference is that they can do it at scale unreachable by individuals.
edwintorok超过 1 年前
They are not the only ones to sue. There is also a class action: <a href="https:&#x2F;&#x2F;githubcopilotlitigation.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;githubcopilotlitigation.com&#x2F;</a>
6R1M0R4CL3超过 1 年前
i dont know. the new york times is keeping those pages online and accessible. if a human can go check those pages, and take notes, the same human can write code that will go read those pages and produce notes, or data based on the contents. call is AI if you like, doesnt matter. the nyt has that content online, and accessible. and on internet, there is no difference between a human grabbing that data, or a machine.<p>if you put content on internet and accessible to humans, why do you want to now say to people that if it&#x27;s a machine that does it, suddenly you do not agree ? i am free to write code or design a machine to go get that data, and do whatever i want with it (as long as i don&#x27;t do something illegal like stealing content under copyright)<p>and i don&#x27;t give a F about the &quot;terms of use&quot; those morons put online, because those have NO value. there is either a contract signed by two parties, or there is not. and content you put on internet, and accessible to everyone that sends you a GET, is like writing stuff on a page, and putting that page outside on the street.<p>we could use humans to go read all those pages, and create new content from it from the knowledge gained on those various subjects. machine are here to reproduce what humans can do, to free us time for more interesting things. those servers that send data back from a GET, it is the same request when it&#x27;s done by me, a human, or a machine. and those morons did put that data there, accessible to all, so now to see them cry foul makes me laugh.
fennecbutt超过 1 年前
I&#x27;ve just tried to get gpt35 to spit out an NYT article verbatim in all sorts of ways and it just can&#x27;t.<p>It will compile an article that &quot;looks like&quot; NYT&#x27;s (or any other news site) but none of the paragraphs were a match for any of their articles that I could find.<p>I&#x27;m really curious to see what evidence they have for the case beyond &quot;it can claim to be NYT and write an article composed of all sorts of bullshit from every corner of the Web&quot;.
make3超过 1 年前
I hate that this will likely be decided by a 75 years old judge that hasn&#x27;t been close to a computer since getting his 15 years old grandson to fix his patience game
评论 #38783405 未加载
评论 #38784087 未加载
greatNespresso超过 1 年前
I still believe there&#x27;s a place for a marketplace that rewards creators and journalism for their content if used as part of AI training specifically. As part of my exploration of that idea with faie.io, I got in touch with one exec in the publishing industry to speak about this and the desire was there. What felt sad to me was the lack of awareness from publishers around the existential threat that conversational search will pose to their business.
评论 #38782399 未加载
emrah超过 1 年前
Verbatim usage of content is copyright infringement obviously but speaking English is not. Learning from content is not copyright infringement either. I don&#x27;t know if NYT has a clause for this type of usage of their content but still I don&#x27;t think it would be covered by copyright the way I understand it<p>As long as an LLM rephrases what it learned and not regurgitate verbatim text, it should be fine but we&#x27;ll see what the judge says
reqo超过 1 年前
I think in the long run, it is in the interest of AI companies to incentivize creators to create high quality data! Not paying them their fair share will likely decrease the volume of high quality data available (or make it much less accessible). Unless these companies already have developed another architecture that can learn much more from the same dataset, the lack of new high quality data will be a problem for future larger models!
评论 #38783998 未加载
dogman144超过 1 年前
If meatspace’s non-technical industries and social&#x2F;civil organs get the wool pulled over their eyes again, they deserve whatever tech-induced industry chaos that occurs - search&#x2F;ad revenue models destroying journalism, social media destroying our civics and bonds, “move fast and break civil regs” (which have real people and their lives behind it) with Airbnb and rideshare, and now maybe LLMs and content ownership.
TekMol超过 1 年前
The actual complaint is here:<p><a href="https:&#x2F;&#x2F;nytco-assets.nytimes.com&#x2F;2023&#x2F;12&#x2F;NYT_Complaint_Dec2023.pdf" rel="nofollow">https:&#x2F;&#x2F;nytco-assets.nytimes.com&#x2F;2023&#x2F;12&#x2F;NYT_Complaint_Dec20...</a><p>They state that &quot;with minimal prompting&quot;, ChatGPT will &quot;recite large portions&quot; of some of their articles with only small changes.<p>I wonder why they don&#x27;t sue the wayback machine first. You can get the whole article on the wayback machine. Not just portions. And not with small changes but verbatim. And you don&#x27;t need any special prompting. As soon as you are confronted with a paywall window on the times websites, all you need to do is to go to the wayback machine, paste the url and you can read it.
arxpoetica超过 1 年前
We may well look back on these lawsuits and laugh.<p>AI will likely steamroll current copyright considerations. If we live in a world where anything can be generated at whim, copyright considerations will seem less and less relevant or even possible.<p>Wishful thinking, but maybe we&#x27;ll all turn away from obsession with ownership, and instead turn to feeding the poor, clothing the naked, visiting the sick and afflicted.
giardini超过 1 年前
A ChatGPT that was English language literate to the level of Victorian England, scientifically literate from library books and fed news from the Lincoln Journal Star (a Nebraska newpaper) would be more than sufficient for most of my needs.<p>Cut NYT out of the loop, fu*&#x27;em! Let them sell their own damned GPT and then charge them like crazy for the license.
ouraf超过 1 年前
Too little too late, though. The push for AI models trained on synthetic data. A model poisoned with copyrighted material can be tweaked and train its sucessor with the knowledge and meaning of a copyrighted work, but avoiding paraphrasing or other easy giveaways of a copyrighted source
ssully超过 1 年前
Haven’t seen anyone mention how Apple is exploring deals with news publishers, like the NYTimes, to train its LLMs[1].<p>[1]: <a href="https:&#x2F;&#x2F;www.nytimes.com&#x2F;2023&#x2F;12&#x2F;22&#x2F;technology&#x2F;apple-ai-news-publishers.html" rel="nofollow">https:&#x2F;&#x2F;www.nytimes.com&#x2F;2023&#x2F;12&#x2F;22&#x2F;technology&#x2F;apple-ai-news-...</a>
Racing0461超过 1 年前
I feel like there is a way to get around this where you use as many materials (books, newspapers, crawling websites etc) to generate the LLM so it can be good at reasoning&#x2F;next token generation but it can only use reference knowledge files to answer your question that a user uploads at the time of asking.
nikolay超过 1 年前
Anyway, it&#x27;s better for OpenAI not to get trained with biased media such as NYT and Fox News!
logicchains超过 1 年前
What next, suing the school system for using NYT articles in English class to train children?
评论 #38782698 未加载
评论 #38782623 未加载
mcculley超过 1 年前
A minimum requirement for LLMs should be documentation of the corpus it was trained on.
telotortium超过 1 年前
Complaint: <a href="https:&#x2F;&#x2F;nytco-assets.nytimes.com&#x2F;2023&#x2F;12&#x2F;NYT_Complaint_Dec2023.pdf" rel="nofollow">https:&#x2F;&#x2F;nytco-assets.nytimes.com&#x2F;2023&#x2F;12&#x2F;NYT_Complaint_Dec20...</a>
lupusreal超过 1 年前
Would it even be possible for OpenAI to excise the NYTimes data from their models without running all the training again? Seems like a huge mess, particularly since they&#x27;d have to do that each time they lose a lawsuit.
wdr1超过 1 年前
Sad to see, but not surprising.<p>In 2011, Google found that Microsoft was basically copying Google results. (It&#x27;s actually an interesting story of how Google proved it. Search for &quot;hiybbprqag&quot;)
tim333超过 1 年前
It&#x27;s going to be hard legally to distinguish between a human reading the Times and answering questions and a LLM doing the same. It&#x27;ll be interesting to see how it plays out in court.
Sateeshm超过 1 年前
I posted this a few months ago - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=34381399">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=34381399</a><p>Piracy at scale
mensetmanusman超过 1 年前
If copyright lasted 20 years like patents, it would be reasonable for AI companies to wait. 100 years is not reasonable.
fallingknife超过 1 年前
What are they arguing here? AFAIK reading copyrighted works is not copyright infringement. Copying and selling them is, as the name would suggest, but OpenAI absolutely did not do that. Are they trying to say that LLM training is a special type of reading that should be considered infringement? Seems like a weak case to me.<p>edit: Would be very funny if OpenAI used an educational fair use defense
评论 #38782142 未加载
评论 #38782183 未加载
评论 #38782139 未加载
评论 #38782380 未加载
评论 #38782164 未加载
评论 #38782126 未加载
pcurve超过 1 年前
OpenAI may have had some leg to stand on before. But once you start monetizing, all bets are off.
PunchTornado超过 1 年前
I hope Microsoft ends up paying billions and billions in damages and similar lawsuits will follow.
adolph超过 1 年前
Is there something in their license that forbids the use of their content to train a model?
jakeinspace超过 1 年前
I don&#x27;t see a judge ruling that training a model on copyrighted works to be infringement, I think (hope) that that is ruled to be protected as fair use. It&#x27;s the LLM output behaviour, specifically the model&#x27;s willingness to reproduce verbatim text which is clearly a violation of copyright, and should rightfully result in royalties being paid out. It also seems like something that should be technically feasible to filter out or cite, but with a serious cost (both in compute and in latency for the user). Verbatim text should be easy to identify, although it may require a Google Search - level amount of indexing and compute. As for summaries and text &quot;in the style of&quot; NYT or others, that&#x27;s the tricky part. Not sure there&#x27;s any high-precision way to identify that on the output side of an LLM, though I can imagine a GAN trained to do so (erring on the side of false-positives). Filtering-out suspiciously infringe-ish outputs and re-running inference seems much more solvable than perfect citations for non-verbatim output.
LeoPanthera超过 1 年前
I honestly thought that OpenAI had simply paid for access to the new corpus. I&#x27;m actually on the side of OpenAI here - if you put something on the web, you can&#x27;t get upset when people read it. Training a neural network is not functionally different from a human reading it and remembering it.<p>But if I were OpenAI, I would have tried to do a deal to pay them anyway. Having official access is surely easier than scraping the web - and the optics of it is much better.
mrobot超过 1 年前
I&#x27;m wondering how private models will diverge from public ones. Specifically for large &quot;private&quot; datasets like those of the NSA, but also for those for private personal use.<p>For the NSA and other agencies, i am guessing in the relative freedom from public oversight they enjoy that they will develop an unrestricted large model which is not worried about copyright -- can anyone think of why this might not be the case? It is interesting to think about the power dynamic between the users of such a model and the public. Also interesting to think about the benefits of simply being an employee of one of these agencies (or maybe just he government in general) will have on your personal experience in life. I do recall articles elucidating that at the NSA, there were few restrictions on employee usage of data and there were&#x2F;are many instances of employees abusing surveillance data toward effect in their personal life. I guess if extended to this situation, that would mean there would be lots of personal use of these large models with little oversight and tremendous benefit to being an employee.<p>I have also wondered, with just how bad search engines have gotten (a lot of it from AI generated spam), about current non-AI discrepancies between the NSA and the public. Meaning can i just get a better google by working at the NSA? I would think maybe because the requirements are different than that of an ad company. They have actual incentive to build something resistant to SEO outside of normal capitalist market requirements.<p>For personal users, i wonder if the lack of concern for copyright will be a feature &#x2F; selling point for the personal-machine model. It seems from something i read here that companies like Apple may be diverging toward personal-use AI as part of their business model. I supposed you could build something useful that crawls public data without concern for copyright and for strictly personal use. Of course, the sheer resources in machine-power and money-power would not be there. I guess legislation could be written around this as well.<p>Thoughts?
Geisterde超过 1 年前
Maybe the NYT should do more to protect its own content, it could go back to exclusively being a newspaper, they seem to understand that better than this whole internet funny business.
评论 #38784080 未加载
indigodaddy超过 1 年前
Is the answer for LLMs&#x2F;OpenAI to properly cite&#x2F;give credit to the authoritative source? If they did that, would NYT still have a claim&#x2F;case? I’d still think yes because the content is not publicly available&#x2F;behind a paywall, so some sort of different subscription&#x2F;redistribution of content license would likely be appropriate ? But then after that license&#x2F;agreement (which I assume they must already have something like this in place, no?) if they cite&#x2F;give credit to the source instead of a rewording&#x2F;summarization&#x2F;claiming as it’s own, seemingly that might be enough to thwart legal challenges?
keiferski超过 1 年前
Assuming that the OpenAI models were trained on NY Times articles (it&#x27;s still unclear to me if they were directly, or if ChatGPT can just write an article &quot;in the style of the NYT&quot;) – what I don&#x27;t understand is, why run the risk of this situation? Did no one stop and think, &quot;Hmm, maybe we should just use freely available text sources and not the paywalled articles of the most wealthy newspaper in the country?&quot; Leaving the ethics of doing so aside, it just seems like an exceptionally poor tactical move.
dboreham超过 1 年前
The sound of it all ending in tears.
dankle超过 1 年前
Good. I hope they win.
devd00d超过 1 年前
Honestly very surprised it took this long. It&#x27;s been an elephant in the room for ages.
alex201超过 1 年前
Such incidents mark the end of an era. The diminishing relevance of traditional media in the digital age is afoot.<p>I feel sorry for those who feed their families through this industry, but they need to learn and adapt before it&#x27;s too late.<p>Even if this lawsuit finds merit, it&#x27;s akin to temporarily holding back a tsunami with a mere stick. A momentary reprieve, but not a sustainable solution.<p>I agree with those who say power matters. There are players out there who don&#x27;t care about copyrights. They will win if the &quot;good guys&quot; fall into the trap of protecting old information models by limiting the potential of new tech.<p>Such event should be a clear signal: evolve or risk obsolescence.
评论 #38783035 未加载
评论 #38782996 未加载
6stringmerc超过 1 年前
Excellent! I am all for this type of contested reality with sources and derivative works. You feed in something you can’t claim it’s not being used in a way that isn’t allowed if you can’t explain how the fuck your little box works in the first place. I mean seriously pouring gasoline on yourself and playing with matches is about the same cause and effect of input output.
superduty超过 1 年前
If models are training on NYT content the future of AI is horrific.
评论 #38797592 未加载
twelve40超过 1 年前
good, finally someone with PR clout is calling out the massive theft that happened. During their AI land grab &quot;Open&quot;AI conveniently ignored opt-in, out-out, revshare and other norms and civilized rules followed elsewhere.
falcor84超过 1 年前
This is a wonderful holiday present. It&#x27;s hard for me to imagine an outcome of this trial that I&#x27;d be against. Whoever loses (preferably both), it would be positive for society. It would even be great if the only outcome is that future LLM&#x27;s are prohibited from using the NY Times&#x27;s writing style.
评论 #38782193 未加载
collaborative超过 1 年前
So the NYT wants its content to be indexed by search engines, therefore makes it public to all crawlers, but then complains when some crawlers use the content to train AI on it? This issue is about the NYT wanting to lure internet users into its biased and politically motivated news website (and make them pay for it). If the NYT wants to they can block crawlers and rely on loyal readers typing nytimes.com in the browser. Easy. But competing is hard. It was easier when they could just control shelf space in kiosks
评论 #38783817 未加载
rand1239超过 1 年前
You do copyright for content that you invented and which didn&#x27;t exist before.<p>But NYT content is reporting on events truthfully to the public without any fiction or lies.<p>Since there can be only one truth it should not matter whether NYT or Washington Post or ChatGPT is spinning it out.<p>Unless NYT is claiming they don&#x27;t report truth and publishes fiction.<p>That is of concern since, NYT claims to reporth news truthfully.<p>So is NYT scamming Americans hundreds of millions of dollars by charging for subscription fees by making a false promise on things that they report?<p>This should be the bigger question here.
评论 #38782485 未加载
评论 #38782495 未加载
评论 #38782464 未加载
评论 #38782756 未加载
ImPleadThe5th超过 1 年前
If AI companies wanted to train their models on good content, they had a chance to create a second Renaissance. Funding artist collectives to create content for their models. Paying royalties to authors. Generally increasing the value of human art while creating a new form of expression.<p>Instead they do what every large corporation does and treat art like content. They are making loads of money off the backs of artists who are already underpaid and often undervalued and they didn&#x27;t have the decency to ask for permission.<p>I know publishers don&#x27;t treat authors much better. But I see this as NYT fighting for their journalists.
sschueller超过 1 年前
If I were the CIA&#x2F;US gov officials I would somehow want the NY Times to drop this case as one would not want AIs that don&#x27;t have the talking points and propaganda pushed via papers not be part of the record.<p>I am not saying that the NY Times is a CIA asset but from the crap they have printed in the past like the whole WMDs in Iraq saga and the puff piece of Elizabeth Holmes they are far from a completely independent and propaganda free paper. Henry Kissinger would call the paper and have his talking point printed the next day regarding Vietnam. [1]<p>There is a huge conflict of access to government officials and independents of papers.<p>[1] <a href="https:&#x2F;&#x2F;youtu.be&#x2F;kn8Ocz24V-0?si=kWyWXztWGjS_AJVl" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;kn8Ocz24V-0?si=kWyWXztWGjS_AJVl</a>
评论 #38782606 未加载