TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

AI-Generated Data Can Poison Future AI Models

147 pointsby meanyabout 1 year ago

22 comments

sophrocyneabout 1 year ago
Some perspectives from someone working in the image space.<p>These tests don&#x27;t feel practical - That is, they seem intended to collapse the model, not demonstrate &quot;in the wild&quot; performance.<p>The assumption is that all content is black or white - AI or not AI - and that you treat all content as equally worth retraining on.<p>It offers no room for assumptions around data augmentation, human-guided quality discrimination, or anything else that might alter the set of outputs to mitigate the &quot;poison&quot;
评论 #39653919 未加载
评论 #39653581 未加载
评论 #39654650 未加载
评论 #39656631 未加载
评论 #39653165 未加载
nestorDabout 1 year ago
I believe that this is a non-problem pushed forward by small-scale experiments that are not representative of what people actually do with AI generation. A lot of new content, while AI generated, has been hand picked and polished by a human (for example, while you might commit AI generated code to your codebase, you ensure that it is correct and follows your preferred style). Content farms will push gibberish out, but they did so, and worse, before and the first generation of models was able to train on the internet anyway.
评论 #39655794 未加载
add-sub-mul-divabout 1 year ago
You&#x27;d think we&#x27;d be concerned about it poisoning the culture, well before any concerns that it would start to interfere with the rich continuing to be able to profit from it doing so.
beeboobaaabout 1 year ago
It shouldn&#x27;t be a problem if you only train on legally acquired data. You will know the authors name and can contact them if you so wish.
评论 #39656183 未加载
评论 #39654140 未加载
评论 #39662015 未加载
buoabout 1 year ago
I think it&#x27;s interesting that human minds generally (though not always!) improve when exposed to the output of other human minds. It seems to be the opposite for current LLMs.
评论 #39652764 未加载
评论 #39652941 未加载
评论 #39653607 未加载
评论 #39652959 未加载
评论 #39653444 未加载
评论 #39652891 未加载
评论 #39652815 未加载
评论 #39654125 未加载
评论 #39681662 未加载
评论 #39653451 未加载
评论 #39653539 未加载
ipythonabout 1 year ago
This reminds me of how fascinated I was as a kid of the artifacts you get from recursively photocopying a piece of paper.
评论 #39652869 未加载
randcrawabout 1 year ago
It&#x27;s fascinating that error can accumulate through repeated trainings that 1) is undetected by humans and 2) can degrade LLM or diffusion models (or any transformer model?) so completely. This implies that not only do we not understand how latent knowledge is actually representated in deep nets, we don&#x27;t know it forms or how it changes during training. If we did, we could have predicted the destructive impact of recycling of output as input. IMO, this suggests we should demand rigorous validation of deep nets (especially generative ones) before relying on them to behave responsibly.
评论 #39655804 未加载
coldcodeabout 1 year ago
I think AI-generated images are worse for training AI generative models than LLMs, since there are so many now on the internet (see Instagram art related hashtags if you want to see nothing but AI art) compared to the quantity of images downloaded prior to 2021 (for those AI that did that). Text will always be more varied than seeing 10m versions of the same ideas that people make for fun. AI text can also be partial (like AI-assisted writing) but the images will all be essentially 100% generated.
评论 #39653784 未加载
esafakabout 1 year ago
Computers need to be able to learn from the world at large, not just their own output. World models are needed to make progress.
评论 #39656189 未加载
ein0pabout 1 year ago
I also wonder what search engines are going to do about all this. Sounds to me, actually, traditional, non-intelligent search might be on its way out, although of course it&#x27;ll take time. Future search engines will have to be quite adept at trying to figure out whether the text they index is bullshit or not.
doubloonabout 1 year ago
reminds me of sheep and cows being fed their bretherens own brain matter developing spongiform encepalopathy (brain disease) or of course cannibals developing kuru. except a purely &#x27;software&#x27; form.
p5vabout 1 year ago
Is there a standard objective metric that can help determine that the quality of a model has degraded over time. In that case, much like source code, you just revert to the old version.
GaggiXabout 1 year ago
Unless the internet is no longer useful because there is no way to find anything reliable, there would be enough signal to train and align models.
评论 #39652820 未加载
评论 #39653061 未加载
Der_Einzigeabout 1 year ago
Certain words, like &quot;groundbreaking&quot;, have been totally ruined for me by LLMs which are too often trained to sound like each other.
Bjorkbatabout 1 year ago
I&#x27;m not sure how much of a risk this is to LLMs in particular, but I feel like we&#x27;re already seeing the impact on image AI models.<p>Even though they&#x27;re getting better at generating hands that make sense and other fine details, you can generally tell that an image is AI generated because it has a certain &quot;style&quot;. Can&#x27;t help but wonder if this is partly due to generated images contaminating the training data and causing subsequent AI image generators to stylistically converge over time.
评论 #39656197 未加载
RecycledEleabout 1 year ago
Synthetic data is a disaster.<p>If you want foom (fast self-improvement in AI) use AIs to filter the training data for the next generation of AIs.
cortesoftabout 1 year ago
Human created content is also filled with gibberish and false information and random noise… how is AI generated content worse?
评论 #39653322 未加载
评论 #39653403 未加载
评论 #39653505 未加载
评论 #39653291 未加载
ur-whaleabout 1 year ago
&gt; AI-Generated Data Can Poison Future AI Models<p>Looks like we didn&#x27;t learn anything from the mad cow disease!
jxdxbxabout 1 year ago
How does this relate to synthetic data?
richk449about 1 year ago
Kessler syndrome for the internet?
chmikeabout 1 year ago
And human generated data may not ?
hermitcrababout 1 year ago
See also: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=39422528">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=39422528</a>