I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:<p>> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.<p>Following that reference:<p>> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).<p>(Presser, 2020) refers to <a href="https://twitter.com/theshawwn/status/1320282149329784833" rel="nofollow">https://twitter.com/theshawwn/status/1320282149329784833</a>. (Which funnily refers to this DMCA policy: <a href="https://the-eye.eu/dmca.mp4" rel="nofollow">https://the-eye.eu/dmca.mp4</a>)<p>Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.<p>Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.<p>[Touvron et al., 2023] <a href="https://arxiv.org/pdf/2302.13971" rel="nofollow">https://arxiv.org/pdf/2302.13971</a><p>[Gao et al., 2020] <a href="https://arxiv.org/pdf/2101.00027" rel="nofollow">https://arxiv.org/pdf/2101.00027</a>