I'm sad they dropped Literotica from the dataset. It's available in the old "preliminary components" archive, if anyone wants to train a Literotica-Enron-Email GPT:<p><a href="https://the-eye.eu/public/AI/pile_preliminary_components/" rel="nofollow">https://the-eye.eu/public/AI/pile_preliminary_components/</a><p>I think The Pile is probably one of the most important AI projects of the last year or so, at least for lone wolf researchers like me. Gathering training data at scale can be excruciatingly difficult. (Perhaps someone will do something similar for GAN training one day: a large dataset for image modeling would help a lot.)<p>By the way, consider contributing to The Eye: <a href="https://the-eye.eu/" rel="nofollow">https://the-eye.eu/</a><p>Without them, I’m not sure any of us would have been able to host the datasets we gathered — or organized torrent seeds, or fielded DMCA complaints, etc. So I feel The Eye deserves an explicit shoutout for being such an asset to the AI community, along with TFRC and Eleuther.
I think it's worth noting that EleutherAI is a grassroots collection of researchers, which distinguishes it from academia/industry labs.<p>As part of their work on democratizing AI, they're now hoping to replicate GPT-3 and release it for free (unlike OpenAI's API).<p>I would encourage everyone interested to join their discord server (<a href="https://discord.gg/BK2v3EJ" rel="nofollow">https://discord.gg/BK2v3EJ</a>) -- they're extremely friendly and I think it's a project worth contributing to.
This is a great effort, and it's important to have datasets like this available to democratise ML learning and work.<p>One small comment: It would be great for this (and other) datasets to give a quick "sample data" file - preferably one that doesn't need to be downloaded to be viewed. Even a screenshot of some of the data would be useful for people browsing to get a quick understanding of the actual content, and how it is formatted. Downloading gigabytes of data just to have a look isn't practical.
Will this dataset be able to produce a model that's actually better than GPT-3? One of the things mentioned in the paper is that the dataset had some buggy filtering applied to it (<a href="https://arxiv.org/pdf/2005.14165.pdf" rel="nofollow">https://arxiv.org/pdf/2005.14165.pdf</a>, page 9), that minimally impacted benchmarking some, there's a whole section on how they tried to deal with it. The gist though is that a cleaner training run, even on slightly more data, may (should?) actually produce something a bit better.<p>Does anyone know if OpenAI has retrained/updated gpt-3 yet?
Love the name. It's what we called our shared FTP-served collection of mp3s during college 15-19 years ago. Yours is a MUCH more impressive amount of information!