They better not have removed the good stuff, like the full texts of the subreddit dedicated to counting to a million, the logs of so many hashed numbers from various cryptos, and the tables of datamined stats from like every console game.
Do they mention anywhere the definition of "low quality" data or the proportion of removed data that was low quality versus duplicate?<p>They mention "When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale." But i guess "upsampling" in this case is just explicit duplication of the training data. So the only potential gains would be from the removal of the low quality data?
I'm interested in seeing this scaled up to larger parameter size models (30b+ parameters), and the dataset expanded with more high-quality data (scientific papers, more books, more code, etc).