Author here. And by author I mean I created books3 (the books component of The Pile) while everyone else did the hard work of actually writing the paper, ha. Stella and Leo Gao in particular did so much wonderful work on the paper, though it couldn’t have happened without everyone’s contributions.<p>As far as I know, this was the first academic contribution from a discord collaboration to ML. Back then discord was barely used for ML at all, though nowadays of course the largest discord in the world is midjourney.<p>There were a bunch of interesting stories from those days. We almost didn’t release at all (or at least the books component) because of fear of copyright backlash. Turns out no one cared, and then suddenly today the world cares a great deal.<p>As a side note, I’ll be participating in a legal action against Meta for the purpose of making ML models uncopyrightable: <a href="https://twitter.com/theshawwn/status/1641804013791215619?s=61&t=jQbmCk1JqL7depzFWJNuPA" rel="nofollow noreferrer">https://twitter.com/theshawwn/status/1641804013791215619?s=6...</a>. They DMCA’ed one of my repos distributing LLaMA, so we fought back and challenged the idea that weights can be copyrighted at all. This seems like the best outcome for hackers and individual researchers, for a few reasons. It’s also one of the most ethical outcomes; since ~no one trains on data that they own, they shouldn’t own the resulting model.<p>One last thing. The Pile would’ve been far less relevant without the wonderful assistance of The Eye, a group of people who archive all kinds of things. They’ve hosted the datasets for years now. And although it seems strange to say that dataset hosting could make or break The Pile, back then there was nobody else willing to host us. <a href="https://the-eye.eu/" rel="nofollow noreferrer">https://the-eye.eu/</a>