The last gen of popular LLMs focuses on publicly accessible web text. But we have lots of other sources of "latent" or "hidden" text. For example, OpenAI's Whisper model can turn audio to text reliably. If you point Whisper at the world's podcasts, that's a whole new source of conversational text. If you point Whisper at YouTube, that's a whole new source of all sorts of text. And then there are all sorts of private sources of text, like UpToDate for doctors, LexisNexis for lawyers, and so forth. I suspect "running out" isn't a within-a-decade concern, especially since text or text-equivalent data grows exponentially in the present internet environment. I think the bigger challenge will be distinguishing human-generated from AI-generated data after 2023.
Brute force approaches always hit some wall. ML will be no different. In the decades to come it us quite likely that algorithms will develop in directions orthogonal to current approaches. The idea that you improve performance by throwing gazillions of data into gargantuan models might be even come to be seen as laughable.<p>Keep in mind (pun) that the only real intelligence here is us, and we are pretty good at figuring out when a tool has exhausted its utility.
On this note, the data-set available if you start collecting today is tainted with experimental AI content. Not the biggest issue right now but as time goes on this problem will get worse and we'll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.
We just are not thinking wide enough:<p>* Train on all of television history, and streaming content.<p>* Train on YouTube.<p>* I suspect at some point we'll have a recording of most of people's lives, e.g. live-streaming: <a href="https://en.wikipedia.org/wiki/Lifestreaming#Lifecasting" rel="nofollow">https://en.wikipedia.org/wiki/Lifestreaming#Lifecasting</a>
If we play our cards right, AI could free people up for more valuable pursuits, and the pace of human information production would increase by orders of magnitude
I wonder if the better question is not how we get more training data but:<p>If we're running out of training data with hallucinations and performance remaining so inadequate (per OpenAI's whitepaper) is an autoregressive transformer the right architecture?<p>Perhaps ongoing work in finetuning will take these models to the next level but ignoring the LLM hype it really does seem like things have plateaued for a while now (with expected gains from scaling).
This take is a bit silly in that they are implying the problem training models will be that we will run out of data. It's more likely that the problem is that the current models require too much data to reach convergence.<p>We've been trying to speed run neural networks science for the past decade but we still don't fully understand how they work. It's like being a bad programmer who doesn't understand algorithms so you compensate by spending money on hardware to make your programs run faster. At some point we will reach a limit where you can't buy your way out of the problem with more data or money and we'll all be forced to return to studying the foundations of the science rather than just trying to scale the existing models up.<p>I am certain when we get to that point everyone will realize we've been trying to feed these models too much data. It makes more sense that our current architectures are just not effective at assimilating the data they have.
This analysis misses the impact of AI models being deployed, like is happening rapidly right now. Production applications built on AI will provide ample (infinite?) additional training data to feed back into the underlying models.
I love how "running out of data" implies that AI companies have access to all the text we ever wrote on all platforms out there. I mean it's probably true...
This is dumb. An individual human takes in more data than modern LLMs do.<p><a href="https://open.substack.com/pub/echoesofid/p/why-llms-struggle-with-basics-too?utm_source=direct&utm_campaign=post&utm_medium=web" rel="nofollow">https://open.substack.com/pub/echoesofid/p/why-llms-struggle...</a>
No.<p>AI can generate as much synthetic data as we need, on demand.<p>Many SOTA models, in fact, are already being trained with synthetic AI-generated data.<p>See <a href="https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines" rel="nofollow">https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines</a>