There was one scenario like this I came up with when brainstorming copyright issues. In the past, articles I read said you could have a backup copy of a copyrighted work that stayed with you. Moving one in a different direction than the other might be a violation since it could be distribution of copies. But, we might be able to make one, digital copy of a physical work for our own use. How to use that?<p>Problem statement for poor person’s pre training: Where to get lots of data for use if they aren’t providing it as data sets and we can’t share them? And for multimodal models? And less risk in copyright?<p>The idea was to buy lots of encyclopedia sets, school curriculums, picture books… used media at low prices (esp bin sales) full of information. Digitize them with book scanners. Keep the digital copies and throw away the physical copies. Now, you have a huge, training set of legal data acquired dirt cheap with preprocessing allowing cheaper labor.<p>From there, use the copies in places like Japan where it’s legal to use them for AI training so long as one has legal access. This also incentivizes the owner to pre-train the model themselves so there no distribution of original works. Also, I envisioned people partly training models with their data, handing the weights off to another company, they add theirs to it, and so on. Daisy chain the process with only the models, not the copyrighted works, distributed. My copyright amendment added preprocessing and copying for just this purpose to avoid these ridiculous hacks.<p>To be clear, I wouldn’t do this without consulting several lawyers on what was clear. I’d rather not be destroying books and filling landfills. It is <i>pure speculation</i> I made due to how overly-strong copyright is in my country. It assumes we can use copyrighted works (a) for personal use and (b) with a physical to digital conversion. If not, I also thought those would be easier rights to fight for, too.<p>However, legal hacks like scanning used works to be trained in other countries might be all we have if legal systems don’t adapt to what society is currently doing with copyrighted works. I mean in a way that’s fair to all sides rather than benefiting only one. I’m up for compromise. Pro-copyright side usually hasn’t been, though.