TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

223 点作者 leogao超过 4 年前

13 条评论

sillysaurusx超过 4 年前
I&#x27;m sad they dropped Literotica from the dataset. It&#x27;s available in the old &quot;preliminary components&quot; archive, if anyone wants to train a Literotica-Enron-Email GPT:<p><a href="https:&#x2F;&#x2F;the-eye.eu&#x2F;public&#x2F;AI&#x2F;pile_preliminary_components&#x2F;" rel="nofollow">https:&#x2F;&#x2F;the-eye.eu&#x2F;public&#x2F;AI&#x2F;pile_preliminary_components&#x2F;</a><p>I think The Pile is probably one of the most important AI projects of the last year or so, at least for lone wolf researchers like me. Gathering training data at scale can be excruciatingly difficult. (Perhaps someone will do something similar for GAN training one day: a large dataset for image modeling would help a lot.)<p>By the way, consider contributing to The Eye: <a href="https:&#x2F;&#x2F;the-eye.eu&#x2F;" rel="nofollow">https:&#x2F;&#x2F;the-eye.eu&#x2F;</a><p>Without them, I’m not sure any of us would have been able to host the datasets we gathered — or organized torrent seeds, or fielded DMCA complaints, etc. So I feel The Eye deserves an explicit shoutout for being such an asset to the AI community, along with TFRC and Eleuther.
评论 #25610617 未加载
legatus超过 4 年前
I think it&#x27;s worth noting that EleutherAI is a grassroots collection of researchers, which distinguishes it from academia&#x2F;industry labs.<p>As part of their work on democratizing AI, they&#x27;re now hoping to replicate GPT-3 and release it for free (unlike OpenAI&#x27;s API).<p>I would encourage everyone interested to join their discord server (<a href="https:&#x2F;&#x2F;discord.gg&#x2F;BK2v3EJ" rel="nofollow">https:&#x2F;&#x2F;discord.gg&#x2F;BK2v3EJ</a>) -- they&#x27;re extremely friendly and I think it&#x27;s a project worth contributing to.
评论 #25608581 未加载
评论 #25608482 未加载
forgingahead超过 4 年前
This is a great effort, and it&#x27;s important to have datasets like this available to democratise ML learning and work.<p>One small comment: It would be great for this (and other) datasets to give a quick &quot;sample data&quot; file - preferably one that doesn&#x27;t need to be downloaded to be viewed. Even a screenshot of some of the data would be useful for people browsing to get a quick understanding of the actual content, and how it is formatted. Downloading gigabytes of data just to have a look isn&#x27;t practical.
评论 #25611511 未加载
评论 #25614046 未加载
w1nk超过 4 年前
Will this dataset be able to produce a model that&#x27;s actually better than GPT-3? One of the things mentioned in the paper is that the dataset had some buggy filtering applied to it (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2005.14165.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2005.14165.pdf</a>, page 9), that minimally impacted benchmarking some, there&#x27;s a whole section on how they tried to deal with it. The gist though is that a cleaner training run, even on slightly more data, may (should?) actually produce something a bit better.<p>Does anyone know if OpenAI has retrained&#x2F;updated gpt-3 yet?
thereticent超过 4 年前
Love the name. It&#x27;s what we called our shared FTP-served collection of mp3s during college 15-19 years ago. Yours is a MUCH more impressive amount of information!
leogao超过 4 年前
Twitter thread: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;nabla_theta&#x2F;status&#x2F;1345130412532645888" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;nabla_theta&#x2F;status&#x2F;1345130412532645888</a>
eutectic超过 4 年前
I&#x27;m kind of surprised there&#x27;s 800GB of text in the world.
评论 #25608675 未加载
评论 #25608720 未加载
评论 #25608834 未加载
评论 #25610661 未加载
评论 #25609414 未加载
评论 #25608669 未加载
评论 #25609021 未加载
bratao超过 4 年前
I´m super excited by this dataset. The EleuterAI team is stellar and many great things are coming soon from them!
pizza超过 4 年前
Related: is it possible for me to download&#x2F;try out a pretrained GPT-Neo?
评论 #25608331 未加载
评论 #25608258 未加载
floatingatoll超过 4 年前
To clarify, they mean diverse as in “unconnected datasets”.
durnygbur超过 4 年前
Someone explain me how this will not stir the wrath of copyright echelons, especially American ones? Or once I get PhD I can torrent movies again?
unixhero超过 4 年前
How can I start working on this dataset? Are there any Python GPT libraries?
mrfusion超过 4 年前
Why not throw Wikipedia into this collection?
评论 #25618885 未加载
评论 #25623985 未加载