I'm sad they dropped Literotica from the dataset. It's available in the old "preliminary components" archive, if anyone wants to train a Literotica-Enron-Email GPT:<p><a href="https://the-eye.eu/public/AI/pile_preliminary_components/" rel="nofollow">https://the-eye.eu/public/AI/pile_preliminary_components/</a><p>I think The Pile is probably one of the most important AI projects of the last year or so, at least for lone wolf researchers like me. Gathering training data at scale can be excruciatingly difficult. (Perhaps someone will do something similar for GAN training one day: a large dataset for image modeling would help a lot.)<p>By the way, consider contributing to The Eye: <a href="https://the-eye.eu/" rel="nofollow">https://the-eye.eu/</a><p>Without them, I’m not sure any of us would have been able to host the datasets we gathered — or organized torrent seeds, or fielded DMCA complaints, etc. So I feel The Eye deserves an explicit shoutout for being such an asset to the AI community, along with TFRC and Eleuther.