Ask HN: DALL-E was trained on watermarked stock images?

266 点作者 whycombinetor超过 2 年前

I just got a Dall-E render with a very intact "gettyimages" watermark on it. I'm no legal expert on whether you have to own the license to something to use it as training input to your AI model, but surely you can't just... use stock photos without paying for the license? Maybe I'm just old fashioned.Prompt: "king of belgium giving a speech to an audience, but the audience members are cucumbers"All 4 results (all no good as far as the prompt is concerned): <a href="https://ibb.co/gz5RDkB" rel="nofollow">https://ibb.co/gz5RDkB</a>Fullsize of the one with the watermark <a href="https://ibb.co/DzGR063" rel="nofollow">https://ibb.co/DzGR063</a>

44 条评论

dlg超过 2 年前

I am not a lawyer, but I've had to argue about copyright with several.In the United States, there are two bits of case law that are widely cited and relevant: In Kelly v. Arriba Soft Corp (9th), found that making thumbnails of images for use in a search engine was sufficiently "transformative" that it was ok. Another case, Perfect 10 (9th), found that thumbnails for image search and cached pages were also transformative.OTOH, cases like Infinity Broad. Corp. v. Kirkwood found that that retransmission of radio broadcast over telephone lines is not transformative.If I understand correctly, there are four parts to the US courts' test for transformativness within fair use (1) character of use (2) creative nature of the work (3) amount or substantiality of copying (4) market harm.I'd think that training a neural network on artwork--including copyrighted stock photos--is almost certainly transformative. However, as you show, a neural network might be overtrained on a specific image and reproduce it too perfectly--that image probably wouldn't fall under fair use.There are also questions of if they violated the CFAA or some agreement crawling the images (but Hiq v Linkedin makes it seem like it's very possible to do legally) and whether they reproduced Getty's logo in a way that violates trademarks (are they trying to use it in trade in a way there could be confusion though?)

评论 #32574821 未加载

评论 #32577238 未加载

评论 #32579717 未加载

评论 #32576342 未加载

评论 #32580731 未加载

评论 #32634248 未加载

chrismorgan超过 2 年前

All large-scale public machine learning stuff is depending on being exempt from copyright restrictions, under fair use doctrine. Look at my responses to all of the threads about Copilot + GPL for more info about that application of it: <a href="https://hn.algolia.com/?query=chrismorgan+copilot+gpl&type=comment" rel="nofollow">https://hn.algolia.com/?query=chrismorgan+copilot+gpl&type=c...</a>.When that is finally tried in court, if it fails to any meaningful extent at all (including going all the way up to Supreme Courts as it doubtless will), then Copilot is dead, DALL·E is dead, GPT-3 is dead, all of these things will be immediately discontinued in at least the affected jurisdictions, at least until such a time as they get the laws changed or judgements overturned.

评论 #32574884 未加载

评论 #32575235 未加载

评论 #32575508 未加载

评论 #32575019 未加载

评论 #32574890 未加载

评论 #32575355 未加载

评论 #32575243 未加载

评论 #32579663 未加载

评论 #32580435 未加载

webwielder2超过 2 年前

These are the absolute worst DALL-E images I've seen. Do people generally just share the amazing ones and most of the output is actually complete shite? Like Instagram presenting the top 1% of people's lives.

评论 #32575105 未加载

评论 #32575958 未加载

评论 #32576532 未加载

评论 #32574990 未加载

评论 #32575915 未加载

评论 #32575215 未加载

评论 #32581701 未加载

评论 #32576100 未加载

评论 #32575971 未加载

评论 #32576689 未加载

评论 #32575066 未加载

评论 #32577109 未加载

评论 #32577036 未加载

评论 #32576757 未加载

评论 #32575886 未加载

BrainVirus超过 2 年前

People here, as always, get hung up on legalese bullshit, but miss the overall picture.The dynamics in play is highly questionable. Countless artists and photographers put effort into creating their works. They put they work online to get some attention and recognition. A company comes along, scrapes all of it and starts selling access to the model to generate something that looks highly derivative. The original cohort of artists and photographers not only get zero money or attention from this new endeavor, they are now in competition with the resulting model.In short, someone whose work was essential to building a thing gets no benefits and possibly even gets (financially) harmed by that thing. Just because this gets verbally labeled "fair use" doesn't make it fair.Additional point:Just a few years ago a bunch of tech companies were talking about "data dignity". Somehow, magically, this (marketing) term is no longer used anywhere.

评论 #32583021 未加载

评论 #32580725 未加载

评论 #32580795 未加载

评论 #32580868 未加载

xg15超过 2 年前

Reminds me of the discussion about GitHub Copilot using the entirety of GitHub as training data. I was honestly baffled how many people, even experts in the field, saw use as training data as non-infringing. With the corrolay that it's apparently perfectly legal to "copyright-wash" a work by feeding it to an AI and have that AI generate a slightly different but extremely similar work.Considering how strict and heavy-handed copyright handling has been otherwise, this has added to my belief that copyright in practice is really just enforcement of the interests of whatever industry has the most power at a given time: When entertainment and content generation was the biggest revenue generator, copyright couldn't be strict enough, now all money is on AI and suddenly loopholes the size of barn doors pop up.

评论 #32577114 未加载

评论 #32580298 未加载

评论 #32577219 未加载

ShamelessC超过 2 年前

> but surely you can't just... use stock photos without paying for the license?They aren't hosting the infringing content. Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.

评论 #32574756 未加载

评论 #32574458 未加载

评论 #32574459 未加载

评论 #32574802 未加载

评论 #32578498 未加载

评论 #32575117 未加载

StillLrning123超过 2 年前

Kids in school are also trained on stock images<a href="https://www.reddit.com/r/KidsAreFuckingStupid/comments/8tgxsm/getty_washington/" rel="nofollow">https://www.reddit.com/r/KidsAreFuckingStupid/comments/8tgxs...</a>

评论 #32575218 未加载

评论 #32575695 未加载

评论 #32575207 未加载

sulam超过 2 年前

I think it’s amusing that many commenters here are perfectly willing to defend DALL-E, but mention Copilot and the discussion looks radically different.

cercatrova超过 2 年前

Based on the new scraping ruling with LinkedIn [0], anything that is "open gate" (as in, accessible without logging in) can be scraped and (I assume) be used by neural networks. The onus, it appears, is to not use it to generate copyrighted works, like Iron Man from Marvel, just as one can use Photoshop as a tool but is still barred from making and selling an Iron Man digital painting.[0] <a href="https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/17-16783.pdf" rel="nofollow">https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/1...</a>

评论 #32573978 未加载

评论 #32573981 未加载

im3w1l超过 2 年前

I remember when people used to say ianal. Innocent times when we thought there was an objective law and lawyers knew it. But that's not how these things work. The truth is that no one knows. Ultimately a bunch of people will decide how they feel about it. Well-read legal scholars trying really hard to be fair, but still just people. No one can predict with full certainty which way it will go.

评论 #32574833 未加载

评论 #32575588 未加载

jcims超过 2 年前

Legally wouldn't it just boil down to the license on the watermarked image?BTW you can add 'royalty free' to the prompt to get rid of those most of the time (lol?).

评论 #32575441 未加载

trention超过 2 年前

My personal opinion is that it's unethical (and possibly illegal, in a subset of cases) to train models on data without explicit consent of the creators of that data. And that really encompasses all data - generative models were not a thing when said data was created and no matter how it was licensed before, explicit consent about using it for model training must be obtained from the creators themselves.That being said, arguments about copyright are just a fig leaf as far as I am concerned. The outcome of whether this is allowed or not will depend on the net impact of using those models on the job market and whether society will be willing to tolerate it.

gojomo超过 2 年前

You may want to use the native 'Share' option, especially on the one with the watermark.You'll get a public link, at `labs.openai.com` rather than some random image-sharing site, which will show the image & the prompt used to generate it (including a credit to "your-first-name × DALL·E").

RcouF1uZ4gsC超过 2 年前

What is interesting is a human analogy.Say you were an artist who went to every art show and museum and studied all the art there.If you produced a work of art solely from memory that contained large portions of other people's copyrighted art, would that still fall under copyright/require licensing?

评论 #32574107 未加载

评论 #32574377 未加载

评论 #32575483 未加载

评论 #32575126 未加载

评论 #32574551 未加载

_trampeltier超过 2 年前

If you read the licence from Getty, they say, you are not allowed to use Getty pictures for ML.

评论 #32575738 未加载

评论 #32575740 未加载

userbinator超过 2 年前

This interesting era of AI will surely teach us the meaning of that old phrase "great artists steal", or more subtly rephrased, "everything is a derived work".

agnosis超过 2 年前

Got the exact same girl from the picture in the ad at the bottom. Creepy! <a href="https://ibb.co/dBLNxQ6" rel="nofollow">https://ibb.co/dBLNxQ6</a>

评论 #32576955 未加载

Geonode超过 2 年前

It doesn't matter. I could put a Getty watermark on anything. Getty would have to show that a generated image was at least in part the same as their image.

评论 #32574532 未加载

surfacedetail超过 2 年前

I'm finding it amusing that everyone immediately assumes infringement, OpenAI is a company that will not be inviting lawsuits.We can't assume any licensing behind closed doors, my guess is that OpenAI has an agreement with Getty, take a look at the licensing in this Observer piece, it's been licensed by Getty, this would indicate that Getty are happy with scraping.<a href="https://www.theguardian.com/commentisfree/2022/aug/20/ai-art-artificial-intelligence-midjourney-dall-e-replacing-artists" rel="nofollow">https://www.theguardian.com/commentisfree/2022/aug/20/ai-art...</a>Besides, this is not infringement in principle, the AI has been trained to think that high-quality news images have watermarks.

registeredcorn超过 2 年前

I don't care much for what laws say. If the only way someones service can work is by ingesting the work of someone else, without compensation, and then compete with that same person, that is wrong.If a company reverse engineers a competitors product, they still buy the product to tear it apart and figure out how it works.If a student learns from their teacher, then goes on to sell a similar kind of work as what their teacher makes, at least the student paid for the classes.This arrangement offers none of that. As long as theft is illegal, this should be. I'd call it parasitic, but it isn't; this is a parasite who's sole intent is to kill the host.

coldtea超过 2 年前

>but surely you can't just... use stock photos without paying for the license?You'd be surprised...

purpleblue超过 2 年前

Is there a copyright protection in terms of consuming a copyright-protected image? I thought it was only for the purpose of displaying that image. If you're reading the file and reading the data, but not displaying it, is that also protected?

评论 #32577547 未加载

vivegi超过 2 年前

Just wait until they build an AI watermark identifier and remover (which is a problem subset) and then use its output to train/update their model.They probably already have specialized filtering models built to filter out censorable terms. They may be imperfect, but they are there. A watermark remover might be an easy addition.When Stable Diffusion released their model playground, I used the prompt Peter at the pearly gates dressed as a security guard and got three images two of which were censored and one that was an ordinary image. So, the capability is there already. Just a matter of time before they get good at watermark removal.

severak_cz超过 2 年前

Probably just some stock photos with watermark sneak in.There are lots of photos with watermark circulating on web, for example in memes and unfinished webpages (when finished, these will be replaced with paid variant without watermark).

davikr超过 2 年前

Yeah, I've seen an image get generated with a very recognizable watermark for a certain stock image company. This happened with a totally unrelated prompt.

评论 #32574666 未加载

RobertoG超过 2 年前

I don't know about the images, but what about the watermark itself? Can I just take any photo and add a proprietary watermark?

sva_超过 2 年前

Similar thing with GH Copilot. I'd say it is still fair use though, even though such things should be filtered out.

评论 #32574758 未加载

fxtentacle超过 2 年前

Yes, Imagen and everything based on LAION 400M or 2B, too.BTW, Copilot also ignored all licenses of the source code it memorized.Datasets are the new capital. If they could, most employees would probably also object to their company using the result of their work to replace their job. But they can't. It's the same with artists here.

JacobiX超过 2 年前

The first thing that I try after generating an image from DALL-E is using reverse image search. I do it on every image that I intend to use, more often than not, I find a very similar image, in this case I discard it and vary my prompts.

评论 #32577495 未加载

topicseed超过 2 年前

What are the best apps and subscriptions to generate these? No private beta, just sign up, put a credit card on file, and use? (Low volume, perhaps 100 images per month, so 300-500 attempts.)Could be great for featured images for blog posts.

throwaway120983超过 2 年前

some people will post images with watermarks on social media or other sites with user generated content. if their dataset included images scraped from them, then it could have gotten in that way

angusturner超过 2 年前

Relevant earlier discussion about this issue: <a href="https://news.ycombinator.com/item?id=32436203" rel="nofollow">https://news.ycombinator.com/item?id=32436203</a>

inasmuch超过 2 年前

Wondered the same thing recently … <a href="https://news.ycombinator.com/item?id=31159231" rel="nofollow">https://news.ycombinator.com/item?id=31159231</a>

davidguetta超过 2 年前

No one fucking cares. For 1 "copyrighted" image theres a thousand free with the same quality or almost.You are wasting CO2 even discussing it

zlqanst超过 2 年前

Obviously you could send it to the copyright holder and find out. In the case of Copilot, Oracle certainly would sue.

humaniania超过 2 年前

Seems more likely to me that they add uploaded images into their data set and someone uploaded a watermarked image.

JaceLightning超过 2 年前

Educational is a fair use category. These tools advance science. I wouldn't expect them to respect copyright.

throwaway120983超过 2 年前

sometimes people will post stock images on sites with user generated content. if their training data included images scraped from those sites, then it could have gotten in that way unintentionally

Asmod4n超过 2 年前

Last time i checked you can source from whatever you want, legislation doesn't care.The last time i checked it was when colpilot got public, they could have trained it only on gpl code. The source license/copyright et all don't matter.

tough超过 2 年前

So what happens if I start selling Dali like pieces?

Cypher超过 2 年前

You transformed the original enough so it's ok

snickerbockers超过 2 年前

Regardless of whether or not training an AI on stock images violates the license, there's a very real problem with that watermark being present, which is that it proves their AI is prone to copying large swaths of images from gettyimages unaltered, and that definitely is a license violation.This makes me think back to the controversy over github copilot; if these AIs are going to be trained on other peoples' IP then somebody needs to be held accountable when they commit plagiarism.Otherwise, im sure Microsoft won't mind my new "gamemaker AI" that i trained on that new halo game last year, or this "OS AI" that I trained on windows 11.

评论 #32575779 未加载

评论 #32576628 未加载

yieldcrv超过 2 年前

some people go into business models that simply have no legal protections

ratonofx超过 2 年前

This copyright "issues" are against the true nature of innovation.By the means of Artificial INTELIGENCE, we must to accept a mind or intelligence is free to perceive external elements and use every stimulus to execute its own creative process.The world is a perpetual iteration cycle amongst human beings. Good artists borrow, great artists steal.

评论 #32578616 未加载

评论 #32634268 未加载

44 条评论

dlg超过 2 年前

评论 #32574821 未加载

评论 #32577238 未加载

评论 #32579717 未加载

评论 #32576342 未加载

评论 #32580731 未加载

评论 #32634248 未加载

chrismorgan超过 2 年前

评论 #32574884 未加载

评论 #32575235 未加载

评论 #32575508 未加载

评论 #32575019 未加载

评论 #32574890 未加载

评论 #32575355 未加载

评论 #32575243 未加载

评论 #32579663 未加载

评论 #32580435 未加载

webwielder2超过 2 年前

评论 #32575105 未加载

评论 #32575958 未加载

评论 #32576532 未加载

评论 #32574990 未加载

评论 #32575915 未加载

评论 #32575215 未加载

评论 #32581701 未加载

评论 #32576100 未加载

评论 #32575971 未加载

评论 #32576689 未加载

评论 #32575066 未加载

评论 #32577109 未加载

评论 #32577036 未加载

评论 #32576757 未加载

评论 #32575886 未加载

BrainVirus超过 2 年前

评论 #32583021 未加载

评论 #32580725 未加载

评论 #32580795 未加载

评论 #32580868 未加载

xg15超过 2 年前

评论 #32577114 未加载

评论 #32580298 未加载

评论 #32577219 未加载

ShamelessC超过 2 年前

评论 #32574756 未加载

评论 #32574458 未加载

评论 #32574459 未加载

评论 #32574802 未加载

评论 #32578498 未加载

评论 #32575117 未加载

StillLrning123超过 2 年前

评论 #32575218 未加载

评论 #32575695 未加载

评论 #32575207 未加载

sulam超过 2 年前

I think it’s amusing that many commenters here are perfectly willing to defend DALL-E, but mention Copilot and the discussion looks radically different.

cercatrova超过 2 年前

评论 #32573978 未加载

评论 #32573981 未加载

im3w1l超过 2 年前

评论 #32574833 未加载

评论 #32575588 未加载

jcims超过 2 年前

Legally wouldn't it just boil down to the license on the watermarked image?BTW you can add 'royalty free' to the prompt to get rid of those most of the time (lol?).

评论 #32575441 未加载

trention超过 2 年前

gojomo超过 2 年前

RcouF1uZ4gsC超过 2 年前

评论 #32574107 未加载

评论 #32574377 未加载

评论 #32575483 未加载

评论 #32575126 未加载

评论 #32574551 未加载

_trampeltier超过 2 年前

If you read the licence from Getty, they say, you are not allowed to use Getty pictures for ML.

评论 #32575738 未加载

评论 #32575740 未加载

userbinator超过 2 年前

This interesting era of AI will surely teach us the meaning of that old phrase "great artists steal", or more subtly rephrased, "everything is a derived work".

agnosis超过 2 年前

Got the exact same girl from the picture in the ad at the bottom. Creepy! <a href="https://ibb.co/dBLNxQ6" rel="nofollow">https://ibb.co/dBLNxQ6</a>

评论 #32576955 未加载

Geonode超过 2 年前

It doesn't matter. I could put a Getty watermark on anything. Getty would have to show that a generated image was at least in part the same as their image.

评论 #32574532 未加载

surfacedetail超过 2 年前

registeredcorn超过 2 年前

coldtea超过 2 年前

>but surely you can't just... use stock photos without paying for the license?You'd be surprised...

purpleblue超过 2 年前

评论 #32577547 未加载

vivegi超过 2 年前

severak_cz超过 2 年前

davikr超过 2 年前

Yeah, I've seen an image get generated with a very recognizable watermark for a certain stock image company. This happened with a totally unrelated prompt.

评论 #32574666 未加载

RobertoG超过 2 年前

I don't know about the images, but what about the watermark itself? Can I just take any photo and add a proprietary watermark?

sva_超过 2 年前

Similar thing with GH Copilot. I'd say it is still fair use though, even though such things should be filtered out.

评论 #32574758 未加载

fxtentacle超过 2 年前

JacobiX超过 2 年前

评论 #32577495 未加载

topicseed超过 2 年前

throwaway120983超过 2 年前

some people will post images with watermarks on social media or other sites with user generated content. if their dataset included images scraped from them, then it could have gotten in that way

angusturner超过 2 年前

Relevant earlier discussion about this issue: <a href="https://news.ycombinator.com/item?id=32436203" rel="nofollow">https://news.ycombinator.com/item?id=32436203</a>

inasmuch超过 2 年前

Wondered the same thing recently … <a href="https://news.ycombinator.com/item?id=31159231" rel="nofollow">https://news.ycombinator.com/item?id=31159231</a>

davidguetta超过 2 年前

No one fucking cares. For 1 "copyrighted" image theres a thousand free with the same quality or almost.You are wasting CO2 even discussing it

zlqanst超过 2 年前

Obviously you could send it to the copyright holder and find out. In the case of Copilot, Oracle certainly would sue.

humaniania超过 2 年前

Seems more likely to me that they add uploaded images into their data set and someone uploaded a watermarked image.

JaceLightning超过 2 年前

Educational is a fair use category. These tools advance science. I wouldn't expect them to respect copyright.

throwaway120983超过 2 年前

sometimes people will post stock images on sites with user generated content. if their training data included images scraped from those sites, then it could have gotten in that way unintentionally

Asmod4n超过 2 年前

tough超过 2 年前

So what happens if I start selling Dali like pieces?

Cypher超过 2 年前

You transformed the original enough so it's ok

snickerbockers超过 2 年前

评论 #32575779 未加载

评论 #32576628 未加载

yieldcrv超过 2 年前

some people go into business models that simply have no legal protections

ratonofx超过 2 年前

评论 #32578616 未加载

评论 #32634268 未加载