TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Why training AI can't be IP theft

45 点作者 OuterVale大约 1 个月前

17 条评论

blagie大约 1 个月前
I asked AI to complete an AGPL code file I wrote a decade ago. It did a pretty good job. What came out wasn&#x27;t 100% identical, but clearly a paraphrased copy of my original.<p>Even if we accept the house-of-cards of shaky arguments this essay is built on, even just for the sake of argument, where Open AI breaks my copyright is by having a computer &quot;memorize&quot; my work. That&#x27;s a form of copy.<p>If I&#x27;ve &quot;learned&quot; Harry Potter to the level where I can reproduce it verbatim, the reproduction would be a copyright violation. If I can paraphrase it, ditto. If I encode it in a different format (e.g. bits on magnetic media, or weights in a model), it still includes a duplicate.<p>On the face of it, OpenAI, Hugging Face, Anthropic, Google, and all other companies are breaking copyright law as written.<p>Usually, when reality and law diverge, law eventually shifts; not reality. Personally, I&#x27;m not a big fan of copyright law as written. We should have a discussion of what it should look like. That&#x27;s a big discussion. I&#x27;ll make a few claims:<p>- We no longer need to encourage technological progress; it&#x27;s moving fast enough. If anything, slowing it down makes sense.<p>- &quot;Fair use&quot; is increasingly vague in an era where I can use AI to take your picture, tweak it, and reproduce an altered version in seconds<p>- Transparency is increasingly important as technology defines the world around us. If the TikTok algorithm controls elections, and Google analyzes my data, it&#x27;s important I know what those are.<p>That&#x27;s the bigger discussion to have.
评论 #43664390 未加载
评论 #43664382 未加载
评论 #43664592 未加载
评论 #43664595 未加载
评论 #43664389 未加载
评论 #43664591 未加载
评论 #43664392 未加载
评论 #43664429 未加载
评论 #43664475 未加载
评论 #43676651 未加载
basch大约 1 个月前
&quot;I think the unambiguous answer to this question is that the act of training is viewing and analysis, not copying. There is no particular copy of the work (or any copyrightable elements) stored in the model. While some models are capable of producing work similar to their inputs, this isn’t their intended function, and that ability is instead an effect of their general utility. Models use input work as the subject of analysis, but they only “keep” the understanding created, not the original work.&quot;<p>The author just seems to have decided the answer and worked backwards. When in reality this is very much a ship of theseus type problem. At what point does a compressed jpeg not become the original image but a transformation? The same thing applies. If i ask a model to recite frankenstein and it largely does, is that not a lossy compression of the original. Would the author argue an mp3 isnt a copy of a song because all the information isnt there?<p>Calling it &quot;training&quot; instead of compression lets the author play semantic games.
评论 #43664472 未加载
评论 #43664478 未加载
评论 #43664617 未加载
TimorousBestie大约 1 个月前
The assumption that human learning and “machine learning” are somehow equivalent (in a physical, ethical, or legal sense—the domain shifts throughout the essay) is not supported with evidence here. They spend a long time describing how machine learning is <i>different</i> from human learning on a computational level, but that doesn’t seem to impact the rest of the argument.<p>I wish AI proponents would use the plain meaning of words in their persuasive arguments, instead of muddying the waters with anthropomorphic metaphors that smuggle in the conclusion.
评论 #43681354 未加载
hyperman1大约 1 个月前
In the EUCD, a copy in RAM falls under copyright, but there is an exception defined (art 5) if the copy is transitory and the target use is legal under copyright. Neither is true for AI, so this article is probably wrong in the EU.<p>Apart from that, I wonder uf an AI is learning in the legal sense of the word. I&#x27;d suspect removing copyright trough learning is something only humans can do, seen trough legal glasses. An AI would be a mechanical device creating a mashup of multiple works, and be a derived work of all of them.<p>Main problem with this rebuttal is how you prove the AI copied your work specifically, and finding out which of the zillions of creative works in that mashup are owned by who.
gavinhoward大约 1 个月前
Copyright reserves most rights to the author by default. And copyright laws thought about future changes.<p>Copyright laws (in the US) added fair use, which has four tests. Not all of the tests need to fail for fair use to disappear. Usually two are enough.<p>The one courts love the most is if the copy is used to create something <i>commercial</i> that competes with the original work.<p>From near the top of the article:<p>&gt; I agree that the dynamic of corporations making for-profit tools using previously published material to directly compete with the original authors, especially when that work was published freely, is “bad.”<p>So essentially, the author admits that AI fails this test.<p>Thus, if authors can show the AI fails another test (and AI usually fails the substantive difference test), AI is copyright infringement. Period.<p>The fact that the article gives up that point so early makes me feel I would be wasting time reading more, but I will still do it.<p>Edit: still reading, but the author talks about enumerated rights. Most lawsuits target the distribution of model outputs because that is <i>reproduction</i>, an enumerated right.<p>Edit 2: the author talks about sunstantive differences, admits they happen aboit 2% of the time, but then seems to argue that means they are not infringing at all. No, they are infringing in those instances.<p>Edit 3: the author claims that model users are the infringing ones, but at least one AI company (Microsoft?) had agreed to indemnify users, so plaintiffs have full right to go after the company instead.
djoldman大约 1 个月前
There are a few stages involved in delivering the output of a LLM or text-to-image model:<p>1. acquire training data<p>2. train on training data<p>3. run inference on trained model<p>4. deliver outputs of inference<p>One can subdivide the above however one likes.<p>My understanding is that most lawsuits are targeting 4. deliver outputs of inference.<p>This is presumably because it has the best chance of resulting in a verdict favorable to the plaintiff.<p>The issue of whether or not it&#x27;s legal to train on training data to which one does not hold copyright is probably moot - businesses don&#x27;t care too much about what you do unless you&#x27;re making money off it.
评论 #43664915 未加载
EdwardDiego大约 1 个月前
That&#x27;s a lot of words to justify what I presume to be the author&#x27;s pre-existing viewpoint.<p>Given that &quot;training&quot; on someone else&#x27;s IP will lead to a regurgitation of some slight permutation of that IP (e.g., all the Studio Ghibli style AI images), I think the author is pushing shit up hill with the word &quot;can&#x27;t&quot;.
评论 #43664814 未加载
评论 #43664601 未加载
ConspiracyFact29 天前
The problem is that model outputs are wholly derivative. This is easy to see if you start with a dataset of one artistic work and add additional works one at a time. Clearly, at the start the outputs are derivative. As more inputs are added, there’s no magical transformation from derivative to non-derivative at any particular point. The output is always a deterministic function of the inputs, or a deterministic output papered over with randomness.<p>“But,” you say, “human art is derivative too in that case!”<p>No. A human artist is influenced by other artists, yes, but he is also influenced by <i>the totality of his life experience</i>, which amounts to much more in terms of “inputs”.
prophesi大约 1 个月前
I think it can be IP theft, and also require labor negotiations. And global technical infrastructure for people to opt-in to having their data trained on. And a method for creators to be compensated if they do opt-in and their work is ingested. And ways for their datasets to be audited by third parties.<p>It sounds like a pipedream, but ethical enforcement of AI training across the globe will require multifaceted solutions that still won&#x27;t stamp out all bad actors.
评论 #43672448 未加载
light_hue_1大约 1 个月前
This it totally the wrong analysis.<p>Think of AI tools like any other tools. If I include code I&#x27;m not allowed to use, like reading a book I pirated, that&#x27;s copyright infringement. If I include an image as an example in my image editor, that&#x27;s ok if I am allowed to copy it.<p>If someone decides to use my image editor to create an image that&#x27;s copyrighted or trademarked, that&#x27;s not the fault of the software. Even if my software says &quot;hey look, here are some cool logos that you might want to draw inspiration from&quot;.<p>People are getting too hung up on the AI part. That&#x27;s irrelevant.<p>This is just software. You need a license for the inputs and if the output is copyrighted that&#x27;s on the user of the software. It&#x27;s a significant risk of just using these models carelessly.
alganet大约 1 个月前
That&#x27;s a lot of text.<p>Where is AI disruptive? If it is disruptive in some area, should we apply old precedents to a thing so radically new? (rethorical).<p>Good fresh training data _will end_. The entire world can&#x27;t feed this machine as fast as it &quot;learns&quot;.<p>To make a farming comparison, it&#x27;s eating the seeds. Any new content gets devoured before it has a chance to grow and give fruit. Furthermore, people are starting to manipulate the model instead of just creating good content. What exactly will we learn then? No one fucking knows. It&#x27;s a power grab free for all waiting to happen. Whoever is poor in compute resources will lose (people! the majority of us).<p>If I am right, we will start seeing anemic LLMs soon. They will get worse with more training, not better. Of course they will still be useful, but not as a liberating learning tool.<p>Let&#x27;s hope I am not right.
bionhoward大约 1 个月前
Did the article mention the part about how these companies turn around and say you’re not allowed to use the output to develop competitive models? I couldn’t find mention of this
Calwestjobs大约 1 个月前
look, quickest example if it IS or it IS NOT ip theft is - go to any image generation ML wizardry prompt machine and ask it this :<p>&quot;generate image of jack ryan investigating nuclear bomb. he has to look like morgan freeman.&quot;<p>(and do it quickly before someone in FAANGM manually plays with something altering result of that prompt)<p>problem is opposite, is &quot;original&quot; work IP a original in itself or is it just remix<p>or someone just gave lawyer some generic text and make it arbitrarily protected for adding 0.000000001% to previous work.
EPWN3D大约 1 个月前
I couldn&#x27;t get through it, did he actually make an argument eventually?
hulitu大约 1 个月前
&gt; Why training AI can&#x27;t be IP theft<p>Because Microsoft is part of BSA. &#x2F;s<p>If you steal our software, it is theft. If we still your software, it is fair use. Can we train AI on leaked Windows source code ?
techpineapple大约 1 个月前
“If humans were somehow required to have an explicit license to learn from work, it would be the end of individual creativity as we know it“<p>What about text books, in order to train on a textbook, I have to pay a licensing fee.
评论 #43664548 未加载
评论 #43664593 未加载
评论 #43664546 未加载
评论 #43664564 未加载
re-thc大约 1 个月前
The argument in the article breaks down by taking marketing by definition and try to apply it to a technical argument.<p>You might as well start by saying that the &quot;cloud&quot; as in some computers really float in the sky. Does AWS rain?<p>This &quot;AI&quot; or rather program is not &quot;training&quot; or &quot;learning&quot; - at least not the way these laws conceived by humans were anticipated or created for. It doesn&#x27;t fit the usual dictionary term of training or learning. If it did we&#x27;d have real AI, i.e. the current term AGI.
评论 #43664417 未加载