An aspect never mentioned is that this data only fed from active users, which are the minority of people.<p>Majority of online users are lurkers, so all of these models are extremely biased on whom they got their information from.
If the generated content is vetted by an adversarial network, we theoretically have a recursively improving AI. Perhaps by introducing more randomness we could even reach some form of evolution that can introduce new concepts, since limited scopes is still what gives AIs away in the end. That is the optimistic perspective at least.<p>On the net, search and content quality is already pretty low for certain key words. If a word is part of the news cycle, expect hundreds of badly researched news paper articles, some of which might be generated as well. Or if they weren't, you wouldn't notice a difference if they become that.<p>But I don't believe the companies made a mistake. They could even protect their position with the data they already acquired and classified. Maybe a quality label would say genuine human®.<p>If all that fails large companies would also be able to employ thousands of low wage workers to classify new content. The increasing memory problem persists, I think that is a race where the model that can extract data as efficiently as possible will win. But without the data sets, there is no way to verify performance.
The article seems to build on several foundations, but one is critical: That the AIs need enormous amount of data, and that that data has to be from a pool that people can now poison with AI-generated data. If that doesn't hold, nothing in the article holds.<p>And it doesn't seem clear to me. It may be true, but far from obvious.<p>For example, DALL-E, which was better than its predecessors. Was it better because of mode input data, that is, was more input data a necessary condition for its improvement? Or even the biggest reason? Reading the OpenAI blog makes it sound as if they had new kinds of AI models, new ideas about models, and that the use of data from a web crawler was little more than a cost optimisation. If that's true, then then it should be possible to build another generation of image AI by combining more new insights with, say, the picture archives of Reuters and other companies with archives of known provenance.<p>Maybe I'm an elitist snob, but the idea that you can generate amazing pictures using the Reuters archive sounds more plausible than that you could do the same using a picture archive from all the world's SEOspam pages. SEOspam just doesn't look intelligent or amazing.
> too many people have pumped the internet full of mediocre generated content with no indication of provenance<p>I don't know about the text models, but e.g. Stable Diffusion (and most of its derived checkpoints) has <i>very</i> recognizable looks.<p>By the way, does anyone know if such generative models could be used as classifiers, answering the "what's the probability that this input was generated by this model" question? That'd help solving the "obtaining quality training data" problem: use the data that has low probability of being generated by any of the most popular models. It's not like people started to produce less hand-made content anyhow!
I am not yet convinced that generated data ‘poisons the well’ if there is some aspect of adversarial training.<p>About 7 years ago when I managed a deep learning team at Capital One, I did a simple experiment of training a GAN to generate synthetic spreadsheet data. The generated data maintained feature statistics and correlations between features. Classification models trained on synthetic data had high accuracy when tested on real data. A few people who worked for me took this idea and built an awesome system out of it.<p>Since the poisoned well is a known thing now, it seems like a solvable problem/
Humans have generated much more content to date than these models are being trained on. Facebook has enormous amounts of human to human interactions at its disposal, and presumably will continue collecting more and more. Likewise there will exist forums where humans write to other humans, like this one, regardless of the pervasiveness of spam on the internet. Finally, most LLM are trained off curated data sets that are not all encompassing of all written text. The process of curating the dataset is necessarily constraining and that means the data admitted so far must be much smaller than the data possible to admit. These analyses also assume we’ve reached a fixed point in the algorithmic ability of these models to converge.<p>I think the truth is we’ve written all that ever needs to be written, and even if they universe becomes populated by on AI LLM chat bots communicating, they will be fine to feast off of what we’ve left them as a legacy.
I recently contributed to the LLM “low background steel” problem. I have a blog where I share Wikipedia articles. I wanted to do one on a wiki article for a topic related to AI, but the article I chose was thin on details. So, instead I asked GPT-4 for the info. I edited the output, found scholarly citations for just about every sentence with Google, and then added it to the wiki page.<p>I’m a copywriter (I know, my days are numbered), so I didn’t just plop the generated text in unchanged, but it’s still primarily GPT’s content. I did a nice job, imho, and improved the article greatly. It’s been up for several days now, so I think it has a good chance of staying long term.<p>Still, it’s a funny situation. On the one hand, I did something I’ve done many times before: researched a topic and added to a wiki article. I always feel gratified when I contribute to Wikipedia. I’m adding to the sum of human knowledge in an incredibly direct way.<p>But, the information I added this time will be used to train future LLMs and thus “poison the well” with generated content. The verdict is still out on just how bad generated content will be for training new models. But I definitely feel slightly conflicted about whether I did something that is a net positive or negative.
I don't think it's so cut and dry. The article paints a somewhat simplistic picture of how the best performing LLMs work. The unsupervised pre-trained networks are indeed data hungry, but the
secondary supervised learning stages actually can get by with a far smaller set of highly curated prompt-response data, e.g. LIMA (<a href="https://arxiv.org/abs/2305.11206" rel="nofollow noreferrer">https://arxiv.org/abs/2305.11206</a>).<p>Another factor is that generative data distributed online may be quite high quality (because people find it interesting enough to share), so it's plausible this could actually improve model performance. Some LLMs have been trained with data from other models with good results, e.g. supposedly Bard with GPT-4 prompt-response pairs and GPT-4 with Whisper transcripts of YouTube (<a href="https://twitter.com/amir/status/1641219919202361344/photo/1" rel="nofollow noreferrer">https://twitter.com/amir/status/1641219919202361344/photo/1</a>). Of course, there could be trolling or misinformation that "poisons" the data, and that is a problem (whether synthetic or organic)!
Another problem with this take is that it's giving humans too much credit here -- there's a vast continuum from pure-human-creation to pure-AI-creation.<p>Human beings are given to
(1) repeating tropes
(2) arguing incoherently
(3) missing the point,
etc.<p>Think of all the books and movies from before 2023. Isn't there a lot that is formulaically wrong/misleading/suboptimal in there?<p>So this might not really be "poisoning the well" -- a more interesting area to look at would be, how can we make GPT-n aware of its own gaps and <i>use</i> this knowledge rather than just let the user know that it knows its gaps?
Doubtful regarding the effort people will take to protect their content from harvesting.<p>Creative content has been continuously devalued for decades now. The whole reason you can find so much music, creative writing, and art, available for free online is because this content is essentially worthless until you can build a brand or clientele to monetize it, and the only way to do that is to broadcast it for free to as many people as possible.
This is a common but misunderstood concern. It’s one of those worries that seems sound in theory, but practice doesn’t bear it out. Remember when people were up in arms about SSD wear cycles? Yeah that’s not actually the way they fail.
There are real problems with AI. This is not one of them.