TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The Pile: An 800GB dataset of diverse text for language modeling (2020)

184 点作者 charlysl将近 2 年前

8 条评论

sillysaurusx将近 2 年前
Author here. And by author I mean I created books3 (the books component of The Pile) while everyone else did the hard work of actually writing the paper, ha. Stella and Leo Gao in particular did so much wonderful work on the paper, though it couldn’t have happened without everyone’s contributions.<p>As far as I know, this was the first academic contribution from a discord collaboration to ML. Back then discord was barely used for ML at all, though nowadays of course the largest discord in the world is midjourney.<p>There were a bunch of interesting stories from those days. We almost didn’t release at all (or at least the books component) because of fear of copyright backlash. Turns out no one cared, and then suddenly today the world cares a great deal.<p>As a side note, I’ll be participating in a legal action against Meta for the purpose of making ML models uncopyrightable: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;theshawwn&#x2F;status&#x2F;1641804013791215619?s=61&amp;t=jQbmCk1JqL7depzFWJNuPA" rel="nofollow noreferrer">https:&#x2F;&#x2F;twitter.com&#x2F;theshawwn&#x2F;status&#x2F;1641804013791215619?s=6...</a>. They DMCA’ed one of my repos distributing LLaMA, so we fought back and challenged the idea that weights can be copyrighted at all. This seems like the best outcome for hackers and individual researchers, for a few reasons. It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.<p>One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. <a href="https:&#x2F;&#x2F;the-eye.eu&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;the-eye.eu&#x2F;</a>
评论 #36688435 未加载
评论 #36687102 未加载
评论 #36688063 未加载
评论 #36686779 未加载
评论 #36687293 未加载
评论 #36689613 未加载
评论 #36691504 未加载
评论 #36691829 未加载
Roark66将近 2 年前
Great stuff, I skimmed the article searching for some table showing a breakdown of content by language, but I haven&#x27;t found one.<p>I hope there is a lot of text in languages other than English. As for example in my language (Polish) current SOTA models are very deffiecient. I have wondered why is that considering companies like (not at all)OpenAI claim to train on large datasets including in my language of interest. It turns out (and I learned this just yesterday) they used LLM translated English content that that used as other language training data. They used Azure translator which itself is a transformer model to generate content for gpt-3.5 for example. Also, I bet there is a lot of poorly machine translated content in their supposedly &quot;original&quot; data.<p>The result? You can use chatgpt to write you an email of any kind in English and you can copy&#x2F;paste&#x2F;send immediately. Try doing that in Polish... It will make sense, but the language used will use bad tone (too familiar in a business setting), bad words(words that exist, but no real person would use) and sentence layout that just plainly feels weird. I suspect this is even worse in many other languages.
评论 #36691747 未加载
评论 #36693691 未加载
dang将近 2 年前
Related:<p><i>The Pile: An 800GB Dataset of Diverse Text for Language Modeling</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36272365">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36272365</a> - June 2023 (5 comments)<p><i>The Pile: An 800GB Dataset of Diverse Text for Language Modeling</i> - <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=25607809">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=25607809</a> - Jan 2021 (60 comments)
cschmidt将近 2 年前
If you’re looking at The Pile, you also might consider the Red Pajama dataset. A new cleaned version was released recently <a href="https:&#x2F;&#x2F;www.cerebras.net&#x2F;blog&#x2F;slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.cerebras.net&#x2F;blog&#x2F;slimpajama-a-627b-token-cleane...</a>
评论 #36689997 未加载
Der_Einzige将近 2 年前
I came so close to getting my dataset DebateSum (<a href="https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;Hellisotherpeople&#x2F;DebateSum" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;datasets&#x2F;Hellisotherpeople&#x2F;DebateSum</a>) into the pile, but they decided at the last minute not to add it: <a href="https:&#x2F;&#x2F;github.com&#x2F;EleutherAI&#x2F;the-pile&#x2F;issues&#x2F;56">https:&#x2F;&#x2F;github.com&#x2F;EleutherAI&#x2F;the-pile&#x2F;issues&#x2F;56</a><p>I&#x27;m still a tiny bit salty about that, but the pile is a wonderful dataset regardless.
评论 #36687135 未加载
charlysl将近 2 年前
OP here. I learned about this while reading Stanford&#x27;s LLM course&#x27;s &quot;Data&quot; lecture [1]. Very interesting how it assesses the datasets used for GPT 2 and 3, etc, and how The Pile addresses their issues. A very interesting course!<p>[1] <a href="https:&#x2F;&#x2F;stanford-cs324.github.io&#x2F;winter2022&#x2F;lectures&#x2F;data&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;stanford-cs324.github.io&#x2F;winter2022&#x2F;lectures&#x2F;data&#x2F;</a>
评论 #36687920 未加载
robertheadley将近 2 年前
As long as LLMs and generative AI uses copywritten works for training, then they are going to be the enemy of creative people.
评论 #36692114 未加载
评论 #36691311 未加载
评论 #36689058 未加载
ryoshiro将近 2 年前
Side Topic: In the leaked OpenAI GPT-training details, there are speculations that OpenAI trained on Libgen dataset. Is there a link to the dataset of Libgen, if so how big is it?