TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

SlimPajama: A 627B token cleaned and deduplicated version of RedPajama

60 点作者 andyk将近 2 年前

3 条评论

ftxbro将近 2 年前
They better not have removed the good stuff, like the full texts of the subreddit dedicated to counting to a million, the logs of so many hashed numbers from various cryptos, and the tables of datamined stats from like every console game.
评论 #36279255 未加载
wskish将近 2 年前
Do they mention anywhere the definition of &quot;low quality&quot; data or the proportion of removed data that was low quality versus duplicate?<p>They mention &quot;When upsampled, we expect SlimPajama to perform equal to or better than RedPajama-1T when training at trillion token scale.&quot; But i guess &quot;upsampling&quot; in this case is just explicit duplication of the training data. So the only potential gains would be from the removal of the low quality data?
评论 #36278880 未加载
Tostino将近 2 年前
I&#x27;m interested in seeing this scaled up to larger parameter size models (30b+ parameters), and the dataset expanded with more high-quality data (scientific papers, more books, more code, etc).