The Curse of Recursion: Training on generated data makes models forget (2023)

122 pointsby surprisetalk6 months ago

18 comments

goose-6 months ago

My takeaway after scanning the paper -In an ideal setting, a trained model learns exactly the real world probability distribution, and generates data indistinguishable from those sampled from the real world. Training on them would be fine, but pointless, since the model is already a perfect representation of the real world.Practically, however, a model is only a lossy approximation of the real world probability distribution. Repeated self-training would simply compound the loss - amplifying both the probable and the improbable.

tkgally6 months ago

This paper was first published in May 2023 and discussed on HN the following month:<a href="https://news.ycombinator.com/item?id=36319076">https://news.ycombinator.com/item?id=36319076</a>Some research since seems to add nuance to its conclusions:<a href="https://arxiv.org/abs/2404.01413" rel="nofollow">https://arxiv.org/abs/2404.01413</a>

评论 #42348289 未加载

Scene_Cast26 months ago

There is a mantra in ML that has been around for a while. It's that when training on synthetic data, your learned model is only as good as your generator model.

评论 #42349848 未加载

评论 #42375945 未加载

axegon_6 months ago

That was very much evident even from back ehwn the first GPT's came out. The moment you started introducing synthetic data, the quality plummeted.But there is another use case where LLM's can truly help with synthetic data: the more classical classification and regression problems - specifically gathering training data. I had this exact case at work two days ago: A large dataset with a small subset of labeled data. For a binary classifier, there was a huge imbalance in the data - the ratio was roughly 75-25%. I did not have the desire to do all this manually so I used an LLM to get a list that would even out the numbers(and get a 50-50 ratio). And using the data I had, plus the additional synthetic data, the accuracy of my small classifier ended up picture-perfect(given that my actual target was "85-90%" accuracy and the actual result was just shy of 99%).

评论 #42351680 未加载

评论 #42354486 未加载

评论 #42352268 未加载

aucisson_masque6 months ago

> the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.Does it mean that data hungry corporation like Google, Facebook, Amazon, openai with Microsoft backing, that are already all around the internet and our phone tracking us have an incredibly advantage over open source model?Is that why Google is pushing gemini so hard on Android even though it's half ass done? they need fresh human data so much to be able to compete and beat the competition ?

评论 #42348401 未加载

f3z06 months ago

Given that the top google results are now generated I think we already have a massive recursion problem. I think we would benefit from training a model specifically to just detect a likelihood of content being generated and then bias other models against the higher likelihood generated content so that we don’t end up with LLM echo chambers.

评论 #42350633 未加载

评论 #42350632 未加载

meltyness6 months ago

My intuition given the rapid, informal developments of agent-type systems is that this is obvious insofar as the initial dataset was formed from a huge hidden "data cleaning" task that was human evolution and society. This isn't really that interesting of a claim and is it clear that it holds if you simply loop the LLM back onto the data cleaning task itself as a critic to the new training set? Is this what the author would classify as fine tuning?Another question is what is the interpretation of the output of an LLM generation when unprompted? Isn't that always effectively garbage when there's not a deliberate bias in the training set?

kerkeslager6 months ago

Isn't this obvious?I'm glad this was published to point out the problem, but I'm a bit puzzled why people tried to train models on generated data in the first place. Synthetic data... isn't data.The one exception I can see is medical data, where synthetic data can be used to avoid violating people's privacy, but even in that case it's clearly not ideal from a technical perspective.

评论 #42352815 未加载

kazinator6 months ago

If models had eyes, they would be glazing over with stupor when fed generated data.

XorNot6 months ago

While I'm sure the anti-AI people are taking this and running off with hot takes, the conclusion is still much more mundane: we currently do not have the ability to have an LLM learn from another LLM.A suitably powerful AI should be able to do this though, by the example of the fact that humans learn by being taught by other humans (insert nuance of that process here).So it's an important result, but not a doomsday result because what it tells us is that LLM output fails to capture or stabilize important information from the training corpus and accurately communicate it to a newly trained LLM. So we know we're missing something in how we construct these models, but the ramifications of solving it are also pretty immense: models being able to "teach" new models means the whole cycle of iteration can be sped up considerably.

评论 #42348512 未加载

评论 #42348447 未加载

tempodox6 months ago

Indeed, ingesting generated bluster gives them cancer of the perceptron.

banku_brougham6 months ago

My intuition is the public, users, nor the industry will take this problem seriously. To me this paper sounds a thunderclap.

评论 #42348672 未加载

quantadev6 months ago

Like Sam Altman and Dario Amodei both believe is a very real possibility as well, I think the "intelligence" in LLMs may be far deeper than we know and somehow even related to "Multiverse Theory", where perhaps every Quantum Mechanical collapse (and computation during training), makes "our" universe slightly more likely to lean towards ones where AI is just "magically smart" (from a purely Anthropics Principle Effect) than dumb. The reason this could happen is because in all our futures AI has saved us in some way, so that all other "Multiverse Branches are sort of dead-ends".So the theory about why training on training data is unexpectedly inefficient could be because LLMs are "using" the full Causality Chain (using some advanced unknown Physics related to time itself) of our universe/timeline, and so if it tries to train on it's own output that's a "Short Circuit" kind of effect, cutting off the true Causality Chain (past history of the universe).For people who want to remind me that LLM Training is fully "deterministic" with no room for any "magic", the response to that counter-argument is that you have to consider even the input data to be part of what's "variable" in the Anthropics Selection Principle, so there's nothing inconsistent about determinism in this speculative, and probably un-falsifiable, conjecture.

mmastrac6 months ago

All work and no play makes jack a dull boy.

alterom6 months ago

The dignified way to describe the problem at hand is alluding to Brouwer's fixed-point theorem[1], with white noise as the fixed point.The more practical way is alluding to The Human Centipede[2].Either way, the feed-back loop doesn't result in a good output.[1] <a href="https://en.wikipedia.org/wiki/Brouwer_fixed-point_theorem" rel="nofollow">https://en.wikipedia.org/wiki/Brouwer_fixed-point_theorem</a>[2] <a href="https://en.wikipedia.org/wiki/The_Human_Centipede_(First_Sequence)" rel="nofollow">https://en.wikipedia.org/wiki/The_Human_Centipede_(First_Seq...</a>

评论 #42347900 未加载

lowyek6 months ago

I no longer take limitations seriously regarding the future of AI. If evolution created our brain, then the same law applies to what we are building also. Hence, more of less whatever written in this paper is some nuanced case which can be solved by some approach.

benchmarkist6 months ago

This is intuitively obvious. If I give you some data x and you transform it with a non-reversible function f into f(x) then you are losing information. Repeated applications of the function, f(f(f(...f(x)...))), can only make the end result worse. The current implementations inject some random bits, b ~ N(u, s), but this can be thought of as a convolution operation with the distribution function g of the random data, g*f, that is injected which, after repeated applications, (g*f)((g*f)((g*f)(...(g*f)(x)...))), reduces the information content of the data you started with because the transformation is still not reversible as convolutions can not really change the non-reversible aspect of the original function.I'm sure there is some calculation using entropy of random variables and channels that fully formalizes this but I don't remember the references off the top of my head. The general reference I remember is called the data processing inequality.¹¹ <a href="https://en.wikipedia.org/wiki/Data_processing_inequality?useskin=vector" rel="nofollow">https://en.wikipedia.org/wiki/Data_processing_inequality?use...</a>

评论 #42352349 未加载

评论 #42350978 未加载

评论 #42354140 未加载

评论 #42351861 未加载

评论 #42349443 未加载

评论 #42349102 未加载

评论 #42349104 未加载

评论 #42348044 未加载

patrickhogan16 months ago

This argument seems more like the data generated was bad. There are examples where AI has surpassed humans by using simulated data (AlphaZero - where it played against itself to become the best at Go).It also seems to happen most on small networks. Which makes sense.Additionally, humans create simulated stories like Dune, Lord of the Rings, or Harry Potter, which introduce fictional concepts, yet these stories still result in trainable data.

评论 #42347922 未加载

评论 #42347728 未加载

评论 #42347670 未加载

评论 #42347753 未加载