TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

AI has poisoned its own well

102 pointsby serhack_almost 2 years ago

13 comments

sogenalmost 2 years ago
An aspect never mentioned is that this data only fed from active users, which are the minority of people.<p>Majority of online users are lurkers, so all of these models are extremely biased on whom they got their information from.
评论 #36395378 未加载
评论 #36399267 未加载
评论 #36416725 未加载
raxxorraxoralmost 2 years ago
If the generated content is vetted by an adversarial network, we theoretically have a recursively improving AI. Perhaps by introducing more randomness we could even reach some form of evolution that can introduce new concepts, since limited scopes is still what gives AIs away in the end. That is the optimistic perspective at least.<p>On the net, search and content quality is already pretty low for certain key words. If a word is part of the news cycle, expect hundreds of badly researched news paper articles, some of which might be generated as well. Or if they weren&#x27;t, you wouldn&#x27;t notice a difference if they become that.<p>But I don&#x27;t believe the companies made a mistake. They could even protect their position with the data they already acquired and classified. Maybe a quality label would say genuine human®.<p>If all that fails large companies would also be able to employ thousands of low wage workers to classify new content. The increasing memory problem persists, I think that is a race where the model that can extract data as efficiently as possible will win. But without the data sets, there is no way to verify performance.
评论 #36392086 未加载
评论 #36394323 未加载
Arntalmost 2 years ago
The article seems to build on several foundations, but one is critical: That the AIs need enormous amount of data, and that that data has to be from a pool that people can now poison with AI-generated data. If that doesn&#x27;t hold, nothing in the article holds.<p>And it doesn&#x27;t seem clear to me. It may be true, but far from obvious.<p>For example, DALL-E, which was better than its predecessors. Was it better because of mode input data, that is, was more input data a necessary condition for its improvement? Or even the biggest reason? Reading the OpenAI blog makes it sound as if they had new kinds of AI models, new ideas about models, and that the use of data from a web crawler was little more than a cost optimisation. If that&#x27;s true, then then it should be possible to build another generation of image AI by combining more new insights with, say, the picture archives of Reuters and other companies with archives of known provenance.<p>Maybe I&#x27;m an elitist snob, but the idea that you can generate amazing pictures using the Reuters archive sounds more plausible than that you could do the same using a picture archive from all the world&#x27;s SEOspam pages. SEOspam just doesn&#x27;t look intelligent or amazing.
评论 #36393993 未加载
Joker_vDalmost 2 years ago
&gt; too many people have pumped the internet full of mediocre generated content with no indication of provenance<p>I don&#x27;t know about the text models, but e.g. Stable Diffusion (and most of its derived checkpoints) has <i>very</i> recognizable looks.<p>By the way, does anyone know if such generative models could be used as classifiers, answering the &quot;what&#x27;s the probability that this input was generated by this model&quot; question? That&#x27;d help solving the &quot;obtaining quality training data&quot; problem: use the data that has low probability of being generated by any of the most popular models. It&#x27;s not like people started to produce less hand-made content anyhow!
评论 #36394381 未加载
mark_l_watsonalmost 2 years ago
I am not yet convinced that generated data ‘poisons the well’ if there is some aspect of adversarial training.<p>About 7 years ago when I managed a deep learning team at Capital One, I did a simple experiment of training a GAN to generate synthetic spreadsheet data. The generated data maintained feature statistics and correlations between features. Classification models trained on synthetic data had high accuracy when tested on real data. A few people who worked for me took this idea and built an awesome system out of it.<p>Since the poisoned well is a known thing now, it seems like a solvable problem&#x2F;
fnordpigletalmost 2 years ago
Humans have generated much more content to date than these models are being trained on. Facebook has enormous amounts of human to human interactions at its disposal, and presumably will continue collecting more and more. Likewise there will exist forums where humans write to other humans, like this one, regardless of the pervasiveness of spam on the internet. Finally, most LLM are trained off curated data sets that are not all encompassing of all written text. The process of curating the dataset is necessarily constraining and that means the data admitted so far must be much smaller than the data possible to admit. These analyses also assume we’ve reached a fixed point in the algorithmic ability of these models to converge.<p>I think the truth is we’ve written all that ever needs to be written, and even if they universe becomes populated by on AI LLM chat bots communicating, they will be fine to feast off of what we’ve left them as a legacy.
评论 #36392905 未加载
cainxinthalmost 2 years ago
I recently contributed to the LLM “low background steel” problem. I have a blog where I share Wikipedia articles. I wanted to do one on a wiki article for a topic related to AI, but the article I chose was thin on details. So, instead I asked GPT-4 for the info. I edited the output, found scholarly citations for just about every sentence with Google, and then added it to the wiki page.<p>I’m a copywriter (I know, my days are numbered), so I didn’t just plop the generated text in unchanged, but it’s still primarily GPT’s content. I did a nice job, imho, and improved the article greatly. It’s been up for several days now, so I think it has a good chance of staying long term.<p>Still, it’s a funny situation. On the one hand, I did something I’ve done many times before: researched a topic and added to a wiki article. I always feel gratified when I contribute to Wikipedia. I’m adding to the sum of human knowledge in an incredibly direct way.<p>But, the information I added this time will be used to train future LLMs and thus “poison the well” with generated content. The verdict is still out on just how bad generated content will be for training new models. But I definitely feel slightly conflicted about whether I did something that is a net positive or negative.
thaw13579almost 2 years ago
I don&#x27;t think it&#x27;s so cut and dry. The article paints a somewhat simplistic picture of how the best performing LLMs work. The unsupervised pre-trained networks are indeed data hungry, but the secondary supervised learning stages actually can get by with a far smaller set of highly curated prompt-response data, e.g. LIMA (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2305.11206" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2305.11206</a>).<p>Another factor is that generative data distributed online may be quite high quality (because people find it interesting enough to share), so it&#x27;s plausible this could actually improve model performance. Some LLMs have been trained with data from other models with good results, e.g. supposedly Bard with GPT-4 prompt-response pairs and GPT-4 with Whisper transcripts of YouTube (<a href="https:&#x2F;&#x2F;twitter.com&#x2F;amir&#x2F;status&#x2F;1641219919202361344&#x2F;photo&#x2F;1" rel="nofollow noreferrer">https:&#x2F;&#x2F;twitter.com&#x2F;amir&#x2F;status&#x2F;1641219919202361344&#x2F;photo&#x2F;1</a>). Of course, there could be trolling or misinformation that &quot;poisons&quot; the data, and that is a problem (whether synthetic or organic)!
achronoalmost 2 years ago
Another problem with this take is that it&#x27;s giving humans too much credit here -- there&#x27;s a vast continuum from pure-human-creation to pure-AI-creation.<p>Human beings are given to (1) repeating tropes (2) arguing incoherently (3) missing the point, etc.<p>Think of all the books and movies from before 2023. Isn&#x27;t there a lot that is formulaically wrong&#x2F;misleading&#x2F;suboptimal in there?<p>So this might not really be &quot;poisoning the well&quot; -- a more interesting area to look at would be, how can we make GPT-n aware of its own gaps and <i>use</i> this knowledge rather than just let the user know that it knows its gaps?
helen___kelleralmost 2 years ago
Doubtful regarding the effort people will take to protect their content from harvesting.<p>Creative content has been continuously devalued for decades now. The whole reason you can find so much music, creative writing, and art, available for free online is because this content is essentially worthless until you can build a brand or clientele to monetize it, and the only way to do that is to broadcast it for free to as many people as possible.
blibblealmost 2 years ago
I certainly replaced my highly starred projects on GitHub with randomly generated crap (build passes!) when they announced copilot
Borriblealmost 2 years ago
Ah, the Ourobous Language Problem.
more_cornalmost 2 years ago
This is a common but misunderstood concern. It’s one of those worries that seems sound in theory, but practice doesn’t bear it out. Remember when people were up in arms about SSD wear cycles? Yeah that’s not actually the way they fail. There are real problems with AI. This is not one of them.