Funny that it does not need that much data to train your average 20th century human genius. I'd say that if we are dreaming of the future of ai, learning and reasoning seems the greatest issue, not data. That said, the article title is about LLMs, so that's what will need changing I guess.
The paradox is that the amount of data available for LLM training is going down, not up, because earlier models made ample use of copyrighted works that later models won't have access to.
Lots of assumption here. First, that we will only be training on text data, if we take into considerations all the videos and audios shared I am quite sure we would have one or two orders of magnitude more of data. Second, that it even matter, there has been some early research showing that training on the right data improves prediction more than training on more data (which intuitively makes sense, training on papers and book is much more useful than training on youtube comments). Additionally, lots of the improvement in quality are because of RLHF, which is basically manual human labeling. And last, my guess is that improvements in architecture are what will unlock the next level of performance, not just scaling.
The amount of data that all the different government agencies has tucked away in their different file cabinets has to be magnitudes more than what's on the public internet. The amount of data in the military... i couldn't even fathom.<p>One data source i've been thinking about that i don't know if they've hit yet is all the different agencies local and state agencies and their private and public meetings, ordinances, discourse, etc...
"If trends continue, language models will fully utilize this stock between 2026 and 2032" - that will require data centers with their own nuclear reactors (or other power plants) as hinted at by Marc Zuckerberg?
Argument of running out of data is kind of stupid.<p>We have billions of cameras, microphones and IMU/GPS sensors. In-fact one in almost every pocket and desk.<p>Survival requires intelligence being energy and resource efficient.<p>Those who build the most powerful and useful models that run locally on edge and are data efficient have a higher chance of winning.<p>Whoever provides the cheapest fastest most useful models will keep on winning.
Were not even close to running out of human generated data. The reason it seems this way is because it's so hard to find old data. There are tons of whole magazine scans on some obscure website that's not even indexed. Most of this is the fault of Google who has been an atrocious steward of search. Why is it that I still can't do full text search of the internet archive dataset? Forever copyright of commercially de minimus works also plays a large role.<p>There's a monumental amount of quality data out there that's not indexed, not searchable, and abandoned but unused. We just need to value it enough to use it.
They haven't even translated non-English material and mixed it all in yet (that I know of)<p>This is big because it would hold novel data the West doesn't access. What is the 'mood' of the average Chinese farmer on Taiwan.<p>Otherwise it's hard to see how adding more text of the same thing is going to create a revolution.<p>Video will be something new. But if like "Her" it watches every Twitch stream simultaneously for a month, and is talking to a billion people for a month and still doesn't get it what else is going to happen?