Apple, Nvidia, Anthropic Used Swiped YouTube Videos to Train AI

41 pointsby gwintrob10 months ago

17 comments

karolist10 months ago

> “YouTube’s terms cover direct use of its platform, which is distinct from use of The Pile dataset. On the point about potential violations of YouTube’s terms of service, we’d have to refer you to The Pile authors.”so basically "we stole from a thief therefore we didn't steal" excuse?

评论 #40989014 未加载

评论 #40988896 未加载

threatofrain10 months ago

For anyone familiar with the legal landscape, except for scenarios where AI products make reproductions of their training material, why isn't this covered under fair use?Don't humans basically do the same thing when attempting to create new music — they derive a lifetime of inspiration from the works of others?

评论 #40988361 未加载

评论 #40988415 未加载

评论 #40988334 未加载

评论 #40988265 未加载

评论 #40988474 未加载

评论 #40988293 未加载

评论 #40988490 未加载

评论 #40988467 未加载

评论 #40988328 未加载

评论 #40989405 未加载

评论 #40988329 未加载

MiguelHudnandez10 months ago

Legal concerns aside, aren't Youtube captions primarily AI-generated in the first place? I know some authors meticulously hand-craft their captions but that can't be the case for the vast majority of videos.Therefore isn't training AI on this basically poisoning your own model? The caption quality is good but there are mistakes in pretty much every video I watch with captions.

评论 #40988528 未加载

TIPSIO10 months ago

I imagine data laundering (?) is common.E.g.: Nike needs to produce a large amount of clothes. They hire an oversea company who commits to the order. They set strict rules -- no child labor, certain quality controls, etc... This company then subcontracts anyway possible and delivers the order, gets paid, and dissolves. Messy but Nike's hands are clean.With AI, same thing but with videos and other forms of data.Hence why a question "did you train with Youtube?" to a certain CTO is so difficult to answer.

评论 #40993040 未加载

raviparikh10 months ago

Whether covered under fair use or not, the laws around copyright today did not anticipate this use case. Congress should pass laws that clarify how data is and isn’t allowed to be used in training AI models, how creators should or shouldn’t be compensated, etc - rather than speculating whether this usage technically does or doesn’t comply with the law as-is.

评论 #40988414 未加载

评论 #40988366 未加载

porphyra10 months ago

I wonder how much of video generative AI depended on the open source project youtube-dl/yt-dlp.

评论 #40988314 未加载

IronWolve10 months ago

Pretty sure, web scraping has been upheld as legal when microsoft lost its case with companies scraping linkedin. And generative content is also legal, thats even includes reposting a copyrighted video even if there is discussion video over the video. Thats an extreme case of fair use, but it shows a wide use case over a copyright video.Personally been using fabric ai tool, since it can summarize youtube videos, so I dont have to watch an hour+ video or read a very long article/journal, just gives me a summary, top talking points or even break it down for tech points.<a href="https://github.com/danielmiessler/fabric">https://github.com/danielmiessler/fabric</a>

评论 #40989110 未加载

blackeyeblitzar10 months ago

I don’t see anything wrong with these companies using YouTube content to train AI in a sense. I think the creators of the videos should be fairly compensated and their permission should be sought, but I don’t think of Google/Alphabet in that way. Sorry but even if Google runs the YouTube platform, I just don’t think they ethically or morally have an exclusive right to the content the world creates, just because they have various monopolies that are immune to competition due to anti-competitive moves and the power of network effects. As far as I am concerned they are a utility service that needs to be heavily regulated.

jmyeet10 months ago

So there is a clear effort to build enclosures around various of corpuses of material that could or would be useful to train AI. Thing is, people read books, they watch videos, they listen to music, they see and produce art and so on. How is training data distinct from "human training data"?One could say the quantity. We're currently dealing with statistical learning models that require a huge quantity of training data. This is temporary. At some point you will be able to train an ML system with less because humans can be trained with less. What then?

GaggiX10 months ago

This is about the Pile dataset, of course we don't if it has been used to train the commercial models we use or just for the research papers mentioned in the article.

RcouF1uZ4gsC10 months ago

How many of the YouTube channels depend on fair use themselves?For example, Jacksepticeye is listed as having their videos used. Looking at the channel, it seems like a lot of it is is recordings of them playing video games.Is the company that produced these games being compensated?

评论 #40988400 未加载

评论 #40988389 未加载

ikekkdcjkfke10 months ago

It's the chicken egg problem (?). ChatGPT couldn't make LLM so valuable without stealing. It's just forcing the freemium model, if google accepts settlement/later payments without any crime charges

sidcool10 months ago

It's an open secret. The only thing missing is the clear evidence.

xhkkffbf10 months ago

They're not the only ones. I've seen other AI firms like Adept doing the same thing. I smell a class-action suit from all of the video makers who weren't protected by YouTube.

评论 #40988382 未加载

londons_explore10 months ago

It's awfully hard to imagine any specific kind of harm that video creators will suffer by having AI trained on their subtitles...

middlefing10 months ago

OP used swiped content to deliver this info.

评论 #40988304 未加载

评论 #40988348 未加载

mmanfrin10 months ago

I wonder how many of the creators complaining make react videos.

评论 #40988461 未加载

评论 #40988360 未加载