> “YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors.”<p>so basically "we stole from a thief therefore we didn't steal" excuse?
For anyone familiar with the legal landscape, except for scenarios where AI products make reproductions of their training material, why isn't this covered under fair use?<p>Don't humans basically do the same thing when attempting to create new music — they derive a lifetime of inspiration from the works of others?
Legal concerns aside, aren't Youtube captions primarily AI-generated in the first place? I know some authors meticulously hand-craft their captions but that can't be the case for the vast majority of videos.<p>Therefore isn't training AI on this basically poisoning your own model? The caption quality is good but there are mistakes in pretty much every video I watch with captions.
I imagine data laundering (?) is common.<p>E.g.: Nike needs to produce a large amount of clothes. They hire an oversea company who commits to the order. They set strict rules -- no child labor, certain quality controls, etc... This company then subcontracts anyway possible and delivers the order, gets paid, and dissolves. Messy but Nike's hands are clean.<p>With AI, same thing but with videos and other forms of data.<p>Hence why a question "did you train with Youtube?" to a certain CTO is so difficult to answer.
Whether covered under fair use or not, the laws around copyright today did not anticipate this use case. Congress should pass laws that clarify how data is and isn’t allowed to be used in training AI models, how creators should or shouldn’t be compensated, etc - rather than speculating whether this usage technically does or doesn’t comply with the law as-is.
Pretty sure, web scraping has been upheld as legal when microsoft lost its case with companies scraping linkedin. And generative content is also legal, thats even includes reposting a copyrighted video even if there is discussion video over the video. Thats an extreme case of fair use, but it shows a wide use case over a copyright video.<p>Personally been using fabric ai tool, since it can summarize youtube videos, so I dont have to watch an hour+ video or read a very long article/journal, just gives me a summary, top talking points or even break it down for tech points.<p><a href="https://github.com/danielmiessler/fabric">https://github.com/danielmiessler/fabric</a>
I don’t see anything wrong with these companies using YouTube content to train AI in a sense. I think the creators of the videos should be fairly compensated and their permission should be sought, but I don’t think of Google/Alphabet in that way. Sorry but even if Google runs the YouTube platform, I just don’t think they ethically or morally have an exclusive right to the content the world creates, just because they have various monopolies that are immune to competition due to anti-competitive moves and the power of network effects. As far as I am concerned they are a utility service that needs to be heavily regulated.
So there is a clear effort to build enclosures around various of corpuses of material that could or would be useful to train AI. Thing is, people read books, they watch videos, they listen to music, they see and produce art and so on. How is training data distinct from "human training data"?<p>One could say the quantity. We're currently dealing with statistical learning models that require a huge quantity of training data. This is temporary. At some point you will be able to train an ML system with less because humans can be trained with less. What then?
This is about the Pile dataset, of course we don't if it has been used to train the commercial models we use or just for the research papers mentioned in the article.
How many of the YouTube channels depend on fair use themselves?<p>For example, Jacksepticeye is listed as having their videos used. Looking at the channel, it seems like a lot of it is is recordings of them playing video games.<p>Is the company that produced these games being compensated?
It's the chicken egg problem (?). ChatGPT couldn't make LLM so valuable without stealing. It's just forcing the freemium model, if google accepts settlement/later payments without any crime charges
They're not the only ones. I've seen other AI firms like Adept doing the same thing. I smell a class-action suit from all of the video makers who weren't protected by YouTube.