科技回声

ftxbro将近 2 年前

They better not have removed the good stuff, like the full texts of the subreddit dedicated to counting to a million, the logs of so many hashed numbers from various cryptos, and the tables of datamined stats from like every console game.

评论 #36279255 未加载

wskish将近 2 年前

Do they mention anywhere the definition of "low quality" data or the proportion of removed data that was low quality versus duplicate?<p>They mention "When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale." But i guess "upsampling" in this case is just explicit duplication of the training data. So the only potential gains would be from the removal of the low quality data?

评论 #36278880 未加载

Tostino将近 2 年前

I'm interested in seeing this scaled up to larger parameter size models (30b+ parameters), and the dataset expanded with more high-quality data (scientific papers, more books, more code, etc).

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

3 条评论

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

3 条评论