TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Will we run out of ML data? Evidence from projecting dataset size trends (2022)

66 pointsby kurhanabout 2 years ago

13 comments

pixelmonkeyabout 2 years ago
The last gen of popular LLMs focuses on publicly accessible web text. But we have lots of other sources of "latent" or "hidden" text. For example, OpenAI's Whisper model can turn audio to text reliably. If you point Whisper at the world's podcasts, that's a whole new source of conversational text. If you point Whisper at YouTube, that's a whole new source of all sorts of text. And then there are all sorts of private sources of text, like UpToDate for doctors, LexisNexis for lawyers, and so forth. I suspect "running out" isn't a within-a-decade concern, especially since text or text-equivalent data grows exponentially in the present internet environment. I think the bigger challenge will be distinguishing human-generated from AI-generated data after 2023.
评论 #35676547 未加载
评论 #35676247 未加载
nologic01about 2 years ago
Brute force approaches always hit some wall. ML will be no different. In the decades to come it us quite likely that algorithms will develop in directions orthogonal to current approaches. The idea that you improve performance by throwing gazillions of data into gargantuan models might be even come to be seen as laughable.<p>Keep in mind (pun) that the only real intelligence here is us, and we are pretty good at figuring out when a tool has exhausted its utility.
评论 #35672145 未加载
评论 #35671649 未加载
StrangeATractorabout 2 years ago
On this note, the data-set available if you start collecting today is tainted with experimental AI content. Not the biggest issue right now but as time goes on this problem will get worse and we&#x27;ll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.
评论 #35675525 未加载
评论 #35675289 未加载
评论 #35673996 未加载
评论 #35675730 未加载
bhoustonabout 2 years ago
We just are not thinking wide enough:<p>* Train on all of television history, and streaming content.<p>* Train on YouTube.<p>* I suspect at some point we&#x27;ll have a recording of most of people&#x27;s lives, e.g. live-streaming: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lifestreaming#Lifecasting" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lifestreaming#Lifecasting</a>
评论 #35675940 未加载
评论 #35675738 未加载
replygirlabout 2 years ago
If we play our cards right, AI could free people up for more valuable pursuits, and the pace of human information production would increase by orders of magnitude
评论 #35671083 未加载
评论 #35671462 未加载
评论 #35670838 未加载
haldujaiabout 2 years ago
I wonder if the better question is not how we get more training data but:<p>If we&#x27;re running out of training data with hallucinations and performance remaining so inadequate (per OpenAI&#x27;s whitepaper) is an autoregressive transformer the right architecture?<p>Perhaps ongoing work in finetuning will take these models to the next level but ignoring the LLM hype it really does seem like things have plateaued for a while now (with expected gains from scaling).
评论 #35670996 未加载
评论 #35675536 未加载
HybridCurveabout 2 years ago
This take is a bit silly in that they are implying the problem training models will be that we will run out of data. It&#x27;s more likely that the problem is that the current models require too much data to reach convergence.<p>We&#x27;ve been trying to speed run neural networks science for the past decade but we still don&#x27;t fully understand how they work. It&#x27;s like being a bad programmer who doesn&#x27;t understand algorithms so you compensate by spending money on hardware to make your programs run faster. At some point we will reach a limit where you can&#x27;t buy your way out of the problem with more data or money and we&#x27;ll all be forced to return to studying the foundations of the science rather than just trying to scale the existing models up.<p>I am certain when we get to that point everyone will realize we&#x27;ve been trying to feed these models too much data. It makes more sense that our current architectures are just not effective at assimilating the data they have.
bobsmoothabout 2 years ago
There&#x27;s gotta be entire libraries that haven&#x27;t been digitized that can be mined for data.
brianrabout 2 years ago
This analysis misses the impact of AI models being deployed, like is happening rapidly right now. Production applications built on AI will provide ample (infinite?) additional training data to feed back into the underlying models.
评论 #35670705 未加载
laserbeamabout 2 years ago
I love how &quot;running out of data&quot; implies that AI companies have access to all the text we ever wrote on all platforms out there. I mean it&#x27;s probably true...
flyvalabout 2 years ago
This is dumb. An individual human takes in more data than modern LLMs do.<p><a href="https:&#x2F;&#x2F;open.substack.com&#x2F;pub&#x2F;echoesofid&#x2F;p&#x2F;why-llms-struggle-with-basics-too?utm_source=direct&amp;utm_campaign=post&amp;utm_medium=web" rel="nofollow">https:&#x2F;&#x2F;open.substack.com&#x2F;pub&#x2F;echoesofid&#x2F;p&#x2F;why-llms-struggle...</a>
评论 #35675475 未加载
评论 #35672870 未加载
MagicMoonlightabout 2 years ago
Only if you rely on dumb learning where it’s learning pure pattern matching rather than interacting and reinforcement learning based on the responses.
cs702about 2 years ago
No.<p>AI can generate as much synthetic data as we need, on demand.<p>Many SOTA models, in fact, are already being trained with synthetic AI-generated data.<p>See <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Betteridge&#x27;s_law_of_headlines" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Betteridge&#x27;s_law_of_headlines</a>
评论 #35670826 未加载
评论 #35669388 未加载
评论 #35670746 未加载
评论 #35668133 未加载
评论 #35671348 未加载