I don't think copyright has much meaning left in it anymore. Any work can be "extracted" into its elements, ideas and style, and recombined in million ways.<p>You could generate a billion images with SD and train the next model on them. Make sure they don't look close to copyrighted works. Being AI generated, they have no copyright. You can still use real data as well if it is in the public domain.<p>If you do this enough the initial copyrighted dataset is going to be further removed from the model. The model can't reproduce a copyrighted work because it hasn't seen any of them during training.<p>But more importantly, this process strictly separates ideas from expression and trains only on ideas without copyrighted expression. If authors complain it means they want to own ideas and styles.<p>You can also use copyrighted works to train a classifier to rank the quality of training examples, and apply it to filter your synthetic data to be higher quality.<p>You can even train a RLHF model to say when two works are "close enough" to constitute an infringement, and double down on safety by ensuring you don't generate or use risky works.<p>That's why I was saying that I don't think copyright has much meaning left in it anymore. Knowledge wants to be free, it travels, shape-shifts and evolves. It does not belong to any one of us except if we keep it to ourselves.