The Curse of Recursion: Training on Generated Data Makes Models Forget

170 点作者 indus将近 2 年前

22 条评论

johnhamlin将近 2 年前

Ted Chiang predicted this in The New Yorker [1] in February in an article that shaped my thinking about what LLMs are capable of achieving in the near future. Chiang compared the summaries LLMs synthesize to a lossy compression algorithm for the internet."There is very little information available about OpenAI’s forthcoming successor to ChatGPT, GPT-4. But I’m going to make a prediction: when assembling the vast amount of text used to train GPT-4, the people at OpenAI will have made every effort to exclude material generated by ChatGPT or any other large language model. If this turns out to be the case, it will serve as unintentional confirmation that the analogy between large language models and lossy compression is useful. Repeatedly resaving a jpeg creates more compression artifacts, because more information is lost every time. It’s the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.Indeed, a useful criterion for gauging a large language model’s quality might be the willingness of a company to use the text that it generates as training material for a new model. If the output of ChatGPT isn’t good enough for GPT-4, we might take that as an indicator that it’s not good enough for us, either. Conversely, if a model starts generating text so good that it can be used to train new models, then that should give us confidence in the quality of that text. (I suspect that such an outcome would require a major breakthrough in the techniques used to build these models.) If and when we start seeing models producing output that’s as good as their input, then the analogy of lossy compression will no longer be applicable."[1] <a href="https://www.newyorker.com/tech/annals-of-technology/chatgpt-is-a-blurry-jpeg-of-the-web" rel="nofollow noreferrer">https://www.newyorker.com/tech/annals-of-technology/chatgpt-...</a>

评论 #36321535 未加载

评论 #36322015 未加载

评论 #36323059 未加载

评论 #36321971 未加载

评论 #36321680 未加载

评论 #36322446 未加载

评论 #36330107 未加载

评论 #36325402 未加载

评论 #36324921 未加载

semiquaver将近 2 年前

Wouldn’t it be funny to find that the capabilities of LLM models have already peaked because we are unable to restrain ourselves from polluting the internet and other training corpus sources with their output?

评论 #36319736 未加载

评论 #36319895 未加载

评论 #36320657 未加载

评论 #36321195 未加载

评论 #36326100 未加载

评论 #36321330 未加载

StrangeATractor将近 2 年前

Hah, I brought this up here a few months ago and was quickly dismissed.I wonder if opening GPT and DALLE to the public was partly intended to pollute subsequent data for anyone that gets into AI down the road. Suddenly a lot of publicly accessible data is worth less, leaving only players who've got a hoard of time-stamped data to compete with (like Google, Facebook). OpenAI almost certainly has the hashes of what it spits out too, so they'll be able to sort the wheat from the chaff for a while yet.The market for data may be getting interesting.

评论 #36326312 未加载

评论 #36322365 未加载

rossdavidh将近 2 年前

I don't believe in the "dead internet theory" as a description of the current situation (mostly), but as a prediction? Maybe.<a href="https://en.wikipedia.org/wiki/Dead_Internet_theory" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Dead_Internet_theory</a>

gmartinsribeiro将近 2 年前

This is not a model problem or synthetic data problem. This is common data science and the article says that: "We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear." Data quality is more important than data volume and if you forget about that... garbage in, garbage out.Make sure you have a representative training dataset, real or synthetic, it doesn't matter.

sebzim4500将近 2 年前

There is a massive flaw in this argument. In real life, whether a given generated work ends up in a future dataset depends on how good it is according to humans. For example, in order for an article to end up in the reddit set it needs at least three upvotes.They could have replicated it here by having GPT-4 score the samples and throwing out most (but not all) of the bad ones. I have no idea what would happen if you e.g. throw out a majority of the bottom 70% and keep the top 30%. It's conceivable to me that it would end up improving or at least not getting much worse with each generation.

评论 #36327234 未加载

visarga将近 2 年前

Generated data tends to be selected and edited by humans, so it is already a bit better than raw. In general a model that takes external feedback into account will be able to self improve, for example a code model would run tests and a chat model could interpret user responses as reward signals.You gotta add something new to the mix. That's possible when the AI is part of a larger system. AlphaZero demonstrated that even self play could be a source of signal, as long as it gets the feedback of the game, the model can learn.

评论 #36321761 未加载

kromem将近 2 年前

I think one of the things overlooked in the discussions here is that the research is solely around the reinforcement against edge cases, but does not qualitatively assess these edge cases.To me, this research supports a hypothesis I've had for a while that we're going to get to truly excellent AI by using synthetic data to bias it towards excellence and away from mediocrity.$20 says the next round of major model training is using synthetic data for training that was filtered through a discriminator trained entirely on human data.The human data as a reference is certainly important to avoid polluting (and to its point there's an advantage for those already having it), but moving away from edge cases isn't necessarily a bad thing practically given edge cases can result in negative practical performance (as opposed to academic next token performance).

评论 #36321683 未加载

评论 #36326068 未加载

tartakovsky将近 2 年前

Same idea here? Larger models do a better job forgetting their training data and dropping their semantic priors. Perhaps another way of thinking through this is that larger models learn new information and drop old information faster. <a href="https://arxiv.org/abs/2303.03846" rel="nofollow noreferrer">https://arxiv.org/abs/2303.03846</a>Isn't that interesting? The idea of "mental liquidity", or "strong opinions weakly held"? <a href="https://news.ycombinator.com/item?id=36280772">https://news.ycombinator.com/item?id=36280772</a>

评论 #36319664 未加载

winddude将近 2 年前

> For the private sector, many homeowners and corporations have longer-term fixed debt, and only some portion of it matures each quarter and gets refinanced at higher rates. As more private debt matures and gets refinanced at higher rates, this will continue to serve as a disinflationary and recessionary force on the economy, especially for sectors that are more sensitive to interest rates.The one thing I don't get and could have been missing in the past... a lot of the corporations and private things, like farms operate on debt. Now maybe it's a bit reductionist, but if you're a farmer operating on debt, if interest rates go up you need to increase prices to cover operating expenses. And this get compounded all the way up to the end consumer as every step in the supply chain marks up by a fixed percent, and because everything is getting more expensive decided lets mark up by a larger percent. So higher interest rates really could be contributing to inflation. And it's just creating a cycle. And with the current levels of debt never seen before in history, it's unlike other periods.

评论 #36321065 未加载

评论 #36324201 未加载

indus将近 2 年前

Side effect: Search engine content if not detected for generated content would be the first to suffer.

评论 #36319302 未加载

guy98238710将近 2 年前

Recursive training of generative models degenerates into reinforcement learning with random feedback. You need strong feedback signal in the loop to survive recursive training. The feedback does not have to come from humans though. You can use anything that grounds the model in reality.

blovescoffee将近 2 年前

Self play in RL is signal enough that machines can learn on their own. How we train models and what class of models is important. No doubt the paper makes good points but I don't think the reality is so black-and-white.

评论 #36321695 未加载

fhood将近 2 年前

I am in way over my head here, so I wasn't able to tell if the authors addressed this, but my intuition is that this should be somewhat mitigated so long as people are providing the filter between which results are discarded and which might end up back in the training pool.I would think that the human selection process would help to head off this conversion, both by selecting against incorrect results, and also by introducing variance outside of the model. On the other hand since a person can only act as a filter, I can also see how that would be of limited value long term.

评论 #36320346 未加载

评论 #36319916 未加载

facu17y将近 2 年前

As long as the synthetic data is good, how can you tell the difference between it and human generated data?This paper has one huge hole in it: it assumes that content on the internet is not moderated and that the training dataset will never evolve to take rating into consideration. On social media, the form of moderation is # of likes. Once detected, bots that output bad data will be banned and content deleted.The key issue I have with the paper is for good synthetic data it is impossible to tell it apart from human generated data.

评论 #36321787 未加载

评论 #36326135 未加载

评论 #36323052 未加载

amuresan将近 2 年前

Not surprising. It always seemed likely to me that there is model bias if you train your models on model generated data, like a feedback loop (second order effects?). Similar to how applying a linear system over and over stretches the inputs in the direction of its largest eigenvector.Now wait till the generated content is indistinguishable from human content (to humans) and it will be hard to figure out what's in your training set.

_lpa_将近 2 年前

LLM Kessler Syndrome.

hiAndrewQuinn将近 2 年前

I often think this about Anki and spaced repetition. At the limiting case it has to be overwriting other memories, right?

评论 #36321172 未加载

fredgrott将近 2 年前

Or in short words we have an upcoming AI collapse as AI output bleeds into the internet space where AI is collecting their inputs in the first place.Actually, not scary as it forces everyone to look for solutions both the in-box kind and the out-of-box kind.

golol将近 2 年前

Just weigh your training data by the engagement it has obtained online. High quality output will have more engagement.

textninja将近 2 年前

I wonder if human learning will be similarly impaired.

antonkar将近 2 年前

I asked Bing's GPT4 to summarize the abstract and then to summarize the summary for 11 times. I was starting a new chat after each summary. I was adding "Summarize the following text:" before each summary. I had to use the Creative mode else Bing was searching instead of summarizing.1st summary: The text discusses the impact of large language models (LLMs) on online text and image generation. It argues that LLMs can cause model collapse, a phenomenon where the original diversity of data is lost due to feedback loops between models and data sources. It shows that model collapse can affect various types of generative models and warns that it poses a threat to the quality and reliability of online content. It suggests that data from real human interactions will become more valuable to prevent model collapse.2nd: The text warns that large language models (LLMs) can cause online content to lose its original diversity and become less trustworthy. It explains that LLMs can create feedback loops that reduce the variety of data they learn from. It suggests that real human data will be more important to avoid this problem.3rd: The text argues that LLMs can make online content more homogeneous and unreliable by creating feedback loops that limit their data sources. It recommends using real human data to prevent this issue.4th: The text warns that LLMs can produce biased and repetitive online content by relying on narrow data sets. It suggests using diverse and authentic human data to avoid this problem.5th: The text advises using varied and real human data to prevent LLMs from creating online content that is unfair and redundant due to limited data sources.6th: The text suggests using diverse and authentic data to avoid biased and repetitive online content by LLMs.7th: The text argues that LLMs need better data to produce more varied and realistic online content.8th: The text claims that language and literature models (LLMs) require more diverse and accurate data to generate online content that reflects the real world.9th: The text argues that LLMs need better data to create realistic online content.10th, Bing started to hallucinate, I double-checked that it didn't search and the chat was new: The text claims that online content creators who use language models (LLMs) should have access to more reliable and diverse data sources to produce realistic and accurate content. It warns that LLMs can generate misleading or harmful content if they are trained on biased or low-quality data. It also suggests some ways to improve the data quality and availability for LLMs, such as using data verification tools, crowdsourcing data collection, and collaborating with other content creators.11th: The text argues that online content creators need better and more diverse data sources to use language models (LLMs) effectively. It cautions that LLMs can produce harmful or inaccurate content if they rely on flawed data. It also recommends some methods to enhance the data quality and access for LLMs, such as verifying data, crowdsourcing data, and cooperating with other creators.