TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

223 pointsby leogaoover 4 years ago

13 comments

sillysaurusxover 4 years ago
I&#x27;m sad they dropped Literotica from the dataset. It&#x27;s available in the old &quot;preliminary components&quot; archive, if anyone wants to train a Literotica-Enron-Email GPT:<p><a href="https:&#x2F;&#x2F;the-eye.eu&#x2F;public&#x2F;AI&#x2F;pile_preliminary_components&#x2F;" rel="nofollow">https:&#x2F;&#x2F;the-eye.eu&#x2F;public&#x2F;AI&#x2F;pile_preliminary_components&#x2F;</a><p>I think The Pile is probably one of the most important AI projects of the last year or so, at least for lone wolf researchers like me. Gathering training data at scale can be excruciatingly difficult. (Perhaps someone will do something similar for GAN training one day: a large dataset for image modeling would help a lot.)<p>By the way, consider contributing to The Eye: <a href="https:&#x2F;&#x2F;the-eye.eu&#x2F;" rel="nofollow">https:&#x2F;&#x2F;the-eye.eu&#x2F;</a><p>Without them, I’m not sure any of us would have been able to host the datasets we gathered — or organized torrent seeds, or fielded DMCA complaints, etc. So I feel The Eye deserves an explicit shoutout for being such an asset to the AI community, along with TFRC and Eleuther.
评论 #25610617 未加载
legatusover 4 years ago
I think it&#x27;s worth noting that EleutherAI is a grassroots collection of researchers, which distinguishes it from academia&#x2F;industry labs.<p>As part of their work on democratizing AI, they&#x27;re now hoping to replicate GPT-3 and release it for free (unlike OpenAI&#x27;s API).<p>I would encourage everyone interested to join their discord server (<a href="https:&#x2F;&#x2F;discord.gg&#x2F;BK2v3EJ" rel="nofollow">https:&#x2F;&#x2F;discord.gg&#x2F;BK2v3EJ</a>) -- they&#x27;re extremely friendly and I think it&#x27;s a project worth contributing to.
评论 #25608581 未加载
评论 #25608482 未加载
forgingaheadover 4 years ago
This is a great effort, and it&#x27;s important to have datasets like this available to democratise ML learning and work.<p>One small comment: It would be great for this (and other) datasets to give a quick &quot;sample data&quot; file - preferably one that doesn&#x27;t need to be downloaded to be viewed. Even a screenshot of some of the data would be useful for people browsing to get a quick understanding of the actual content, and how it is formatted. Downloading gigabytes of data just to have a look isn&#x27;t practical.
评论 #25611511 未加载
评论 #25614046 未加载
w1nkover 4 years ago
Will this dataset be able to produce a model that&#x27;s actually better than GPT-3? One of the things mentioned in the paper is that the dataset had some buggy filtering applied to it (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2005.14165.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2005.14165.pdf</a>, page 9), that minimally impacted benchmarking some, there&#x27;s a whole section on how they tried to deal with it. The gist though is that a cleaner training run, even on slightly more data, may (should?) actually produce something a bit better.<p>Does anyone know if OpenAI has retrained&#x2F;updated gpt-3 yet?
thereticentover 4 years ago
Love the name. It&#x27;s what we called our shared FTP-served collection of mp3s during college 15-19 years ago. Yours is a MUCH more impressive amount of information!
leogaoover 4 years ago
Twitter thread: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;nabla_theta&#x2F;status&#x2F;1345130412532645888" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;nabla_theta&#x2F;status&#x2F;1345130412532645888</a>
eutecticover 4 years ago
I&#x27;m kind of surprised there&#x27;s 800GB of text in the world.
评论 #25608675 未加载
评论 #25608720 未加载
评论 #25608834 未加载
评论 #25610661 未加载
评论 #25609414 未加载
评论 #25608669 未加载
评论 #25609021 未加载
brataoover 4 years ago
I´m super excited by this dataset. The EleuterAI team is stellar and many great things are coming soon from them!
pizzaover 4 years ago
Related: is it possible for me to download&#x2F;try out a pretrained GPT-Neo?
评论 #25608331 未加载
评论 #25608258 未加载
floatingatollover 4 years ago
To clarify, they mean diverse as in “unconnected datasets”.
durnygburover 4 years ago
Someone explain me how this will not stir the wrath of copyright echelons, especially American ones? Or once I get PhD I can torrent movies again?
unixheroover 4 years ago
How can I start working on this dataset? Are there any Python GPT libraries?
mrfusionover 4 years ago
Why not throw Wikipedia into this collection?
评论 #25618885 未加载
评论 #25623985 未加载