TechEcho

13 comments

pixelmonkeyabout 2 years ago

The last gen of popular LLMs focuses on publicly accessible web text. But we have lots of other sources of "latent" or "hidden" text. For example, OpenAI's Whisper model can turn audio to text reliably. If you point Whisper at the world's podcasts, that's a whole new source of conversational text. If you point Whisper at YouTube, that's a whole new source of all sorts of text. And then there are all sorts of private sources of text, like UpToDate for doctors, LexisNexis for lawyers, and so forth. I suspect "running out" isn't a within-a-decade concern, especially since text or text-equivalent data grows exponentially in the present internet environment. I think the bigger challenge will be distinguishing human-generated from AI-generated data after 2023.

评论 #35676547 未加载

评论 #35676247 未加载

nologic01about 2 years ago

Brute force approaches always hit some wall. ML will be no different. In the decades to come it us quite likely that algorithms will develop in directions orthogonal to current approaches. The idea that you improve performance by throwing gazillions of data into gargantuan models might be even come to be seen as laughable.Keep in mind (pun) that the only real intelligence here is us, and we are pretty good at figuring out when a tool has exhausted its utility.

评论 #35672145 未加载

评论 #35671649 未加载

StrangeATractorabout 2 years ago

On this note, the data-set available if you start collecting today is tainted with experimental AI content. Not the biggest issue right now but as time goes on this problem will get worse and we'll be basing our simulations of intelligence on the output of our simulations of intelligence, a brave new abstraction.

评论 #35675525 未加载

评论 #35675289 未加载

评论 #35673996 未加载

评论 #35675730 未加载

bhoustonabout 2 years ago

We just are not thinking wide enough:* Train on all of television history, and streaming content.* Train on YouTube.* I suspect at some point we'll have a recording of most of people's lives, e.g. live-streaming: <a href="https://en.wikipedia.org/wiki/Lifestreaming#Lifecasting" rel="nofollow">https://en.wikipedia.org/wiki/Lifestreaming#Lifecasting</a>

评论 #35675940 未加载

评论 #35675738 未加载

replygirlabout 2 years ago

If we play our cards right, AI could free people up for more valuable pursuits, and the pace of human information production would increase by orders of magnitude

评论 #35671083 未加载

评论 #35671462 未加载

评论 #35670838 未加载

haldujaiabout 2 years ago

I wonder if the better question is not how we get more training data but:If we're running out of training data with hallucinations and performance remaining so inadequate (per OpenAI's whitepaper) is an autoregressive transformer the right architecture?Perhaps ongoing work in finetuning will take these models to the next level but ignoring the LLM hype it really does seem like things have plateaued for a while now (with expected gains from scaling).

评论 #35670996 未加载

评论 #35675536 未加载

HybridCurveabout 2 years ago

This take is a bit silly in that they are implying the problem training models will be that we will run out of data. It's more likely that the problem is that the current models require too much data to reach convergence.We've been trying to speed run neural networks science for the past decade but we still don't fully understand how they work. It's like being a bad programmer who doesn't understand algorithms so you compensate by spending money on hardware to make your programs run faster. At some point we will reach a limit where you can't buy your way out of the problem with more data or money and we'll all be forced to return to studying the foundations of the science rather than just trying to scale the existing models up.I am certain when we get to that point everyone will realize we've been trying to feed these models too much data. It makes more sense that our current architectures are just not effective at assimilating the data they have.

bobsmoothabout 2 years ago

There's gotta be entire libraries that haven't been digitized that can be mined for data.

brianrabout 2 years ago

This analysis misses the impact of AI models being deployed, like is happening rapidly right now. Production applications built on AI will provide ample (infinite?) additional training data to feed back into the underlying models.

评论 #35670705 未加载

laserbeamabout 2 years ago

I love how "running out of data" implies that AI companies have access to all the text we ever wrote on all platforms out there. I mean it's probably true...

flyvalabout 2 years ago

This is dumb. An individual human takes in more data than modern LLMs do.<a href="https://open.substack.com/pub/echoesofid/p/why-llms-struggle-with-basics-too?utm_source=direct&utm_campaign=post&utm_medium=web" rel="nofollow">https://open.substack.com/pub/echoesofid/p/why-llms-struggle...</a>

评论 #35675475 未加载

评论 #35672870 未加载

MagicMoonlightabout 2 years ago

Only if you rely on dumb learning where it’s learning pure pattern matching rather than interacting and reinforcement learning based on the responses.

cs702about 2 years ago

No.AI can generate as much synthetic data as we need, on demand.Many SOTA models, in fact, are already being trained with synthetic AI-generated data.See <a href="https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines" rel="nofollow">https://en.wikipedia.org/wiki/Betteridge's_law_of_headlines</a>

评论 #35670826 未加载

评论 #35669388 未加载

评论 #35670746 未加载

评论 #35668133 未加载

评论 #35671348 未加载

13 comments

pixelmonkeyabout 2 years ago

评论 #35676547 未加载

评论 #35676247 未加载

nologic01about 2 years ago

评论 #35672145 未加载

评论 #35671649 未加载

StrangeATractorabout 2 years ago

评论 #35675525 未加载

评论 #35675289 未加载

评论 #35673996 未加载

评论 #35675730 未加载

bhoustonabout 2 years ago

评论 #35675940 未加载

评论 #35675738 未加载

replygirlabout 2 years ago

If we play our cards right, AI could free people up for more valuable pursuits, and the pace of human information production would increase by orders of magnitude

评论 #35671083 未加载

评论 #35671462 未加载

评论 #35670838 未加载

haldujaiabout 2 years ago

评论 #35670996 未加载

评论 #35675536 未加载

HybridCurveabout 2 years ago

bobsmoothabout 2 years ago

There's gotta be entire libraries that haven't been digitized that can be mined for data.

brianrabout 2 years ago

评论 #35670705 未加载

laserbeamabout 2 years ago

I love how "running out of data" implies that AI companies have access to all the text we ever wrote on all platforms out there. I mean it's probably true...

flyvalabout 2 years ago

评论 #35675475 未加载

评论 #35672870 未加载

MagicMoonlightabout 2 years ago

Only if you rely on dumb learning where it’s learning pure pattern matching rather than interacting and reinforcement learning based on the responses.

Will we run out of ML data? Evidence from projecting dataset size trends (2022)

13 comments

Will we run out of ML data? Evidence from projecting dataset size trends (2022)

13 comments