This is not even a new problem...<p>Back in 2011, Google faced the same problem mining bi-texts from the Internet for their statistical machine translation software. The thought was that one could utilize things like multi-lingual websites to learn corresponding translations.<p>They quickly realized that a lot of sites were actually using Google Translate without human intervention to make multi-lingual versions of their site, so naive approaches would cause the model to get trained on its own suboptimal output.<p>So they came up with a whole watermarking system so that the model could recognize its own output with some statistical level of certainty, and avoid it. It wouldn't be surprising if this is being done for LLMs too. The more concerning problem is when different LLMs, who are not aware of each others' watermarks, end up potentially becoming inbred should the ratio of LLM content rise dramatically...<p>Ref: <a href="https://aclanthology.org/D11-1126.pdf" rel="nofollow">https://aclanthology.org/D11-1126.pdf</a>
If ChatGPT is able to emit output that is watermarked such that it can detect itself as Scott Aaronson and others are working on for OpenAI (source: <a href="https://techcrunch.com/2022/12/10/openais-attempts-to-watermark-ai-text-hit-limits/" rel="nofollow">https://techcrunch.com/2022/12/10/openais-attempts-to-waterm...</a> ) this “resonance”/feedback/eating-itself can be avoided.
I've seen people try ChatGPT for solving r/tipOfMyTongue questions. The AI is hilariously bad at this task. It happily invents new plots for existing movies and books.<p>if it starts to ingest that data it will only get more wrong over time. Unless it also ingest the replies that say "ChatGPT is full of shit here?"
Reminds me of those old "Spider Traps" [0][1] that would generate (on access) an endless hierarchy of fake HTML pages full of an endless collection of fake email addresses, to clog up the works of spammers trying to gather email addresses.<p>Eventually someone's going to write an "AI Trap" that serves up a seemingly infinite forum or reddit-style site, but is actually just generating an endless stream of (non)consciousness from some LLM chatbot.<p>[0] <a href="https://en.wikipedia.org/wiki/Spider_trap" rel="nofollow">https://en.wikipedia.org/wiki/Spider_trap</a><p>[1] <a href="https://www.gsp.com/support/virtual/web/cgi/lib/wpoison/" rel="nofollow">https://www.gsp.com/support/virtual/web/cgi/lib/wpoison/</a>
“Romeo and Juliet both ran away to New York at the end. He works in corporate finance and she makes bespoke soap. If you disagree with me again you’re a bad person and I will treat you like a bad person.”<p>As long as you agree with the new facts, you’re fine. Problem solved!
It's already happening.<p>“ChatGPT, a version of OpenAI’s GPT-3.5 model… gained more than 100m users in its first two months, and is now estimated to produce a volume of text every 14 days that is equivalent to all the printed works of humanity.”<p>— Dr Thompson, Feb/2023, cited in report by the National Bureau of Economic Research (Scholes, Bernanke, MIT)<p><a href="https://www.nber.org/system/files/working_papers/w30957/w30957.pdf" rel="nofollow">https://www.nber.org/system/files/working_papers/w30957/w309...</a><p><a href="https://lifearchitect.ai/chatgpt/" rel="nofollow">https://lifearchitect.ai/chatgpt/</a>
Even if it were used to flood the internet with shitty info, the only thing that would interfere with would be competitors training competing AI off the "internet dataset"<p>GPT could filter out anything they themselves emitted in future trains, yeah?
Because they know what their bot's said.
They get the benefit of looking at a conversation, knowing reasonably well what's copy/pasted from ai.com and what's the exasperated expert trying to correct a doomed world :p<p>The only way it eats itself is 1. Colossal mistakes. 2. Everyone decides to get off the internet and go outside.<p>2 seems pretty unrealistic, we put up with a lot :D
Sounds like a /r/showerthoughts post.<p>There is no issue with AI ingesting data from itself in itself. Humans do it as well. That data might even be higher quality than human data. The scale at which humans produce data will most likely stay higher than AI data for a long time.<p>There is already bot data out there from lower quality AIs/bots, and chatGPT has ingested it.<p>LLMs are made to be good at some textual tasks, and not for what they're being used right now. They're not information stores, or Q/A. It only answers what a human is likely to answer.
This is only a problem as long as ChatGPT uses human output to learn. Once it starts learning against the "real world", or itself, the biggest difference between ChatGPT and us will disappear: that ChatGPT gets all it's information secondhand, and filtered, at best.<p>This is of course <i>also</i> a necessary condition for ChatGPT to come up with original insights. Except perhaps where it comes to things like fiction, which probably has value in itself.
Citation needed. A lot of neural-net based AIs actually get better when trained on their own output[1].<p>[1] <a href="https://en.wikipedia.org/wiki/AlphaZero" rel="nofollow">https://en.wikipedia.org/wiki/AlphaZero</a>
I actually thought of this same thing today! Human-written content seems more lively... and with time... content from ChatGPT will become more "grey" (i.e. dull) (as more & more ChatGPT content gets fed into the system...).
Not really if you think about it more and research how llms work. If anything they will just get better.<p>I used to think the same but after reading and learning some more, I realized not.
to resonate against itself?
sounds like its gonna hit its natural frequency and blow up<p>seems more like it's gonna eat its own vomit, degrading it (maybe not completely)
to inbreed (?)
I wonder if this is the problem people think it is.<p>Playing one AI against another is an established technique to developing AI.<p>Furthermore, content on the internet will always vary from more reliable (well established wiki pages, Reuters) to less reliable (random blog posts, disinformation).<p>Whether or not an AI generated text doesn't seem to be that important - what's more important is how reliable it is, and how well humans engage with it.
What does that even mean? Strictly within the scope of that phrase, technically, yes, if ChatGPT consumes content generated by itself, it's eating its own words. I'm guessing something more dire than that is implied by "eat itself." Did humanity "eat itself" because it's been reading its own literature? You can say we are pretty misinformed by ourselves in many areas, and yet here we are.<p>Maybe our view of AI is being colored by sci-fi stereotypes of robots malfunctioning when asked to compute really hard problems generating infinite recursion. I'm not so sure that LLMs will totally destabilize. We might see some interesting output, but I don't think we know yet whether the stability of the system will merely fluctuate as a whole without falling apart.