Feels weird that the Authors Guild so much wants to make public the names of the former OpenAI employees who created the data sets. That seems entirely unnecessary and irrelevant to the case. If what they did was part of their sanctioned work for OpenAI, they are not the ones responsible for how it was used.
"These datasets, created by former employees who are no longer with OpenAI, were last used in 2021 and <i>deleted</i> due to non-use in 2022." -<p>Deleting the dataset because of non-use sounds completely implausible. It says the dataset is 67B tokens, which is less than 1TB of data. Why would you bother to delete it given it would cost more or less nothing to keep?
Assuming they train newer models using output from older versions, isn't that data they collected from "books1" and "books2" still encoded in the weights of their current models?<p>This also begs the question, does OpenAI really honor their privacy controls and not use user information to train their models if the user opts out? It seems most companies are operating in "ask for forgiveness than permission"-mode as they scramble to stay competitive in the AI race. [0]<p>[0] <a href="https://news.ycombinator.com/item?id=40127106">https://news.ycombinator.com/item?id=40127106</a>
> The unsealed letter from OpenAI's lawyers, which is labeled "highly confidential - attorneys' eyes only," says that the use of "books1" and "books2" for model training was discontinued in late 2021 and that the datasets were deleted in mid-2022 because of their nonuse. The letter goes on to say that none of the other data used to train GPT-3 has been deleted and offers attorneys for the Authors Guild access to those other datasets.<p>That sounds like ass-covering, and maybe destruction of evidence. If the data is destroyed, won't it be much harder to prove which books they violated copyright on and to figure out the damages owed?
There was one scenario like this I came up with when brainstorming copyright issues. In the past, articles I read said you could have a backup copy of a copyrighted work that stayed with you. Moving one in a different direction than the other might be a violation since it could be distribution of copies. But, we might be able to make one, digital copy of a physical work for our own use. How to use that?<p>Problem statement for poor person’s pre training: Where to get lots of data for use if they aren’t providing it as data sets and we can’t share them? And for multimodal models? And less risk in copyright?<p>The idea was to buy lots of encyclopedia sets, school curriculums, picture books… used media at low prices (esp bin sales) full of information. Digitize them with book scanners. Keep the digital copies and throw away the physical copies. Now, you have a huge, training set of legal data acquired dirt cheap with preprocessing allowing cheaper labor.<p>From there, use the copies in places like Japan where it’s legal to use them for AI training so long as one has legal access. This also incentivizes the owner to pre-train the model themselves so there no distribution of original works. Also, I envisioned people partly training models with their data, handing the weights off to another company, they add theirs to it, and so on. Daisy chain the process with only the models, not the copyrighted works, distributed. My copyright amendment added preprocessing and copying for just this purpose to avoid these ridiculous hacks.<p>To be clear, I wouldn’t do this without consulting several lawyers on what was clear. I’d rather not be destroying books and filling landfills. It is <i>pure speculation</i> I made due to how overly-strong copyright is in my country. It assumes we can use copyrighted works (a) for personal use and (b) with a physical to digital conversion. If not, I also thought those would be easier rights to fight for, too.<p>However, legal hacks like scanning used works to be trained in other countries might be all we have if legal systems don’t adapt to what society is currently doing with copyrighted works. I mean in a way that’s fair to all sides rather than benefiting only one. I’m up for compromise. Pro-copyright side usually hasn’t been, though.
So when is the author’s guild going to realize that you can pretrain a base model on public domain material and then once it’s distributed anyone can fine tune it on whatever books they want to have their own version that writes in the style of X? On commodity GPUs no less. In 5 years when the current gen GPUs are cheap and there are hundreds of one click fine tuning apps this will seem like an absurd waste of time.
Isn't copyright about reproducing and/or distributing?<p>Does training a model count as reproduction or distribution?<p>Or can copyright say something about how I consume my books? Can copyright prohibit me to light a fire, or whipe my behind with, say Harry Potter and the goblet of fire?
Isn't all/most their training data copyrighted anyways?<p>We just have to say it's fair use, because it is useful to everyone. Maybe just require them to open their model.