AI-Generated Data Can Poison Future AI Models

147 pointsby meanyabout 1 year ago

22 comments

sophrocyneabout 1 year ago

Some perspectives from someone working in the image space.These tests don't feel practical - That is, they seem intended to collapse the model, not demonstrate "in the wild" performance.The assumption is that all content is black or white - AI or not AI - and that you treat all content as equally worth retraining on.It offers no room for assumptions around data augmentation, human-guided quality discrimination, or anything else that might alter the set of outputs to mitigate the "poison"

评论 #39653919 未加载

评论 #39653581 未加载

评论 #39654650 未加载

评论 #39656631 未加载

评论 #39653165 未加载

nestorDabout 1 year ago

I believe that this is a non-problem pushed forward by small-scale experiments that are not representative of what people actually do with AI generation. A lot of new content, while AI generated, has been hand picked and polished by a human (for example, while you might commit AI generated code to your codebase, you ensure that it is correct and follows your preferred style). Content farms will push gibberish out, but they did so, and worse, before and the first generation of models was able to train on the internet anyway.

评论 #39655794 未加载

add-sub-mul-divabout 1 year ago

You'd think we'd be concerned about it poisoning the culture, well before any concerns that it would start to interfere with the rich continuing to be able to profit from it doing so.

beeboobaaabout 1 year ago

It shouldn't be a problem if you only train on legally acquired data. You will know the authors name and can contact them if you so wish.

评论 #39656183 未加载

评论 #39654140 未加载

评论 #39662015 未加载

buoabout 1 year ago

I think it's interesting that human minds generally (though not always!) improve when exposed to the output of other human minds. It seems to be the opposite for current LLMs.

评论 #39652764 未加载

评论 #39652941 未加载

评论 #39653607 未加载

评论 #39652959 未加载

评论 #39653444 未加载

评论 #39652891 未加载

评论 #39652815 未加载

评论 #39654125 未加载

评论 #39681662 未加载

评论 #39653451 未加载

评论 #39653539 未加载

ipythonabout 1 year ago

This reminds me of how fascinated I was as a kid of the artifacts you get from recursively photocopying a piece of paper.

评论 #39652869 未加载

randcrawabout 1 year ago

It's fascinating that error can accumulate through repeated trainings that 1) is undetected by humans and 2) can degrade LLM or diffusion models (or any transformer model?) so completely. This implies that not only do we not understand how latent knowledge is actually representated in deep nets, we don't know it forms or how it changes during training. If we did, we could have predicted the destructive impact of recycling of output as input. IMO, this suggests we should demand rigorous validation of deep nets (especially generative ones) before relying on them to behave responsibly.

评论 #39655804 未加载

coldcodeabout 1 year ago

I think AI-generated images are worse for training AI generative models than LLMs, since there are so many now on the internet (see Instagram art related hashtags if you want to see nothing but AI art) compared to the quantity of images downloaded prior to 2021 (for those AI that did that). Text will always be more varied than seeing 10m versions of the same ideas that people make for fun. AI text can also be partial (like AI-assisted writing) but the images will all be essentially 100% generated.

评论 #39653784 未加载

esafakabout 1 year ago

Computers need to be able to learn from the world at large, not just their own output. World models are needed to make progress.

评论 #39656189 未加载

ein0pabout 1 year ago

I also wonder what search engines are going to do about all this. Sounds to me, actually, traditional, non-intelligent search might be on its way out, although of course it'll take time. Future search engines will have to be quite adept at trying to figure out whether the text they index is bullshit or not.

doubloonabout 1 year ago

reminds me of sheep and cows being fed their bretherens own brain matter developing spongiform encepalopathy (brain disease) or of course cannibals developing kuru. except a purely 'software' form.

p5vabout 1 year ago

Is there a standard objective metric that can help determine that the quality of a model has degraded over time. In that case, much like source code, you just revert to the old version.

GaggiXabout 1 year ago

Unless the internet is no longer useful because there is no way to find anything reliable, there would be enough signal to train and align models.

评论 #39652820 未加载

评论 #39653061 未加载

Der_Einzigeabout 1 year ago

Certain words, like "groundbreaking", have been totally ruined for me by LLMs which are too often trained to sound like each other.

Bjorkbatabout 1 year ago

I'm not sure how much of a risk this is to LLMs in particular, but I feel like we're already seeing the impact on image AI models.Even though they're getting better at generating hands that make sense and other fine details, you can generally tell that an image is AI generated because it has a certain "style". Can't help but wonder if this is partly due to generated images contaminating the training data and causing subsequent AI image generators to stylistically converge over time.

评论 #39656197 未加载

RecycledEleabout 1 year ago

Synthetic data is a disaster.If you want foom (fast self-improvement in AI) use AIs to filter the training data for the next generation of AIs.

cortesoftabout 1 year ago

Human created content is also filled with gibberish and false information and random noise… how is AI generated content worse?

评论 #39653322 未加载

评论 #39653403 未加载

评论 #39653505 未加载

评论 #39653291 未加载

ur-whaleabout 1 year ago

> AI-Generated Data Can Poison Future AI ModelsLooks like we didn't learn anything from the mad cow disease!

jxdxbxabout 1 year ago

How does this relate to synthetic data?

richk449about 1 year ago

Kessler syndrome for the internet?

chmikeabout 1 year ago

And human generated data may not ?

hermitcrababout 1 year ago

See also: <a href="https://news.ycombinator.com/item?id=39422528">https://news.ycombinator.com/item?id=39422528</a>