Some perspectives from someone working in the image space.<p>These tests don't feel practical - That is, they seem intended to collapse the model, not demonstrate "in the wild" performance.<p>The assumption is that all content is black or white - AI or not AI - and that you treat all content as equally worth retraining on.<p>It offers no room for assumptions around data augmentation, human-guided quality discrimination, or anything else that might alter the set of outputs to mitigate the "poison"
I believe that this is a non-problem pushed forward by small-scale experiments that are not representative of what people actually do with AI generation.
A lot of new content, while AI generated, has been hand picked and polished by a human (for example, while you might commit AI generated code to your codebase, you ensure that it is correct and follows your preferred style).
Content farms will push gibberish out, but they did so, and worse, before and the first generation of models was able to train on the internet anyway.
You'd think we'd be concerned about it poisoning the culture, well before any concerns that it would start to interfere with the rich continuing to be able to profit from it doing so.
I think it's interesting that human minds generally (though not always!) improve when exposed to the output of other human minds. It seems to be the opposite for current LLMs.
It's fascinating that error can accumulate through repeated trainings that 1) is undetected by humans and 2) can degrade LLM or diffusion models (or any transformer model?) so completely. This implies that not only do we not understand how latent knowledge is actually representated in deep nets, we don't know it forms or how it changes during training. If we did, we could have predicted the destructive impact of recycling of output as input. IMO, this suggests we should demand rigorous validation of deep nets (especially generative ones) before relying on them to behave responsibly.
I think AI-generated images are worse for training AI generative models than LLMs, since there are so many now on the internet (see Instagram art related hashtags if you want to see nothing but AI art) compared to the quantity of images downloaded prior to 2021 (for those AI that did that). Text will always be more varied than seeing 10m versions of the same ideas that people make for fun. AI text can also be partial (like AI-assisted writing) but the images will all be essentially 100% generated.
I also wonder what search engines are going to do about all this. Sounds to me, actually, traditional, non-intelligent search might be on its way out, although of course it'll take time. Future search engines will have to be quite adept at trying to figure out whether the text they index is bullshit or not.
reminds me of sheep and cows being fed their bretherens own brain matter developing spongiform encepalopathy (brain disease) or of course cannibals developing kuru. except a purely 'software' form.
Is there a standard objective metric that can help determine that the quality of a model has degraded over time. In that case, much like source code, you just revert to the old version.
I'm not sure how much of a risk this is to LLMs in particular, but I feel like we're already seeing the impact on image AI models.<p>Even though they're getting better at generating hands that make sense and other fine details, you can generally tell that an image is AI generated because it has a certain "style". Can't help but wonder if this is partly due to generated images contaminating the training data and causing subsequent AI image generators to stylistically converge over time.