How to fix “AI’s original sin”

64 pointsby tysone11 months ago

13 comments

kragen11 months ago

i think plausibly being able to use youtube video as training data was the major reason for google to buy youtube in the first place. i'd be very surprised if youtube terms of service actually prohibit google from doing thisalso, while a lot of tim's thoughts are excellent, i strongly disagree with this part> When someone reads a book, watches a video, or attends a live training, the copyright holder gets paidreading books, watching videos, and attending trainings or other performances are not rights reserved to copyright holders, and indeed the history of copyright law carefully and specifically excludes such activities from requiring copyright licenses. consequently copyright holders do not in fact get paid for them. the first sale doctrine means that, in the usa (where the nyt has filed their lawsuit), not only can copyright holders not charge people for reading books and watching videos, they can't even charge them from reselling used books and videosthis is fundamental to the freedom of thought and inquiry that underlie liberal civilization; it's not a minor detail

评论 #40746433 未加载

评论 #40746867 未加载

评论 #40745630 未加载

评论 #40746034 未加载

评论 #40746807 未加载

wrs11 months ago

I’m having a hard time understanding Tim’s point here. He seems to have retrieval and generation confused, or combined, or something.Of course in the retrieval (R in RAG) you can do attribution of the source material: you’re just doing a search and bringing up a literal excerpt of something from your database. You know exactly where you got it.But then for generation (G) you hand that excerpt to the LLM, and the only reason the model can “understand” it is the vastly larger corpus of text you trained the model on, whose origin is, due to the very nature of LLMs, smeared out of existence. That is the controversial aspect that (AFAIK) has no technical solution at present.He seems to imply that because attributing the R part is easy, attributing the G part should be too, but those are completely independent problems. Not to mention that you don’t have to do a retrieval step to generate things in the first place; the LLM alone can do that.The part about doing a retrieval on the output to see if it’s similar to something else is at least technically possible, but he handwaves past the problem of what on earth you’re supposed to do if you find something. YouTube doesn’t do a great job (e.g., people getting copyright strikes from their own performances of public domain works) and it at least has an unambiguous set of things to search for.

评论 #40746983 未加载

neilwilson11 months ago

> When someone reads a book, watches a video, or attends a live training, the copyright holder gets paidThat's what copyright holders would like - hence why they try to restrict the licences so much these days.It isn't the case. Many individuals can read, watch and listen to a work where only one payment has been made.But really all the AI bots are doing is reading, watching and listening to the content and remembering it in pretty much the same way as a person does.Is an artist who has listened to a bazillion blues numbers and then constructs a new song based upon what they have heard and ingested in violation of copyright?If not, then neither is AI. It's just a probability matrix, not a replica.If the AI people have paid the correct fee to listen, watch or read the material and then sell what they remember it is no different from any trained professional. The contract has been fulfilled.The problem we have is that copyright holders have been dining off past glories for too long, and have gained the power to embed that rent for way longer than is sensible. What AI does is make those copyrights rot faster as it has a better memory than most people. That's good for society, because it forces copyright holders to produce more new stuff if they want to maintain their income.A rebalancing away from rent seekers towards regular producers would be good for everybody.

评论 #40747092 未加载

评论 #40747077 未加载

protocolture11 months ago

I mean the way to fix it is to recognise that its absolutely cool to train AI on publicly available knowledge. Like its not a sin.Maybe it comes from growing up with google hoovering up the internet, file sharing becoming common place and image boards making sharing copyrighted photos as reaction images the done thing. But I already feel like I own the sum total of human knowledge. I dont recognise sony or disney or the US government as valid inheritors or controllers of information. Its mine. And if there's a set of tools that can chew on that data to make new or interesting or collated or curated data then that just makes more data that I own. I own your art. I own your code. I own your stack overflow answers. I own every film and tv show and book from human history. And if that common ownership isn't the end goal then I dont know what is.If free use doesnt currently cover these cases it should be extended to cover them.

评论 #40746488 未加载

评论 #40747035 未加载

评论 #40750658 未加载

musicale11 months ago

AI's original sin isn't copyright violation. It's training machines on human-produced output.

mensetmanusman11 months ago

This is the middle class fighting the middle class; the billionaires that own both of types of organizations and the AI capital will inevitably use it to increase productivity and capture all of the excess monetary gains while the middle class shrinks.

评论 #40745174 未加载

评论 #40745159 未加载

arh6811 months ago

> respect signals like subscription paywalls, the robots.txt file, the HTML “noindex” keyword, terms of service, and other means by which copyright holders signal their intentions.And if they disrespect those signals & terms, and lie, what then?> the copyrighted content has been ingested, but it is detected during the output phase as part of an overall content management pipeline.Yes, but how can it be detected reliably? Considering there is much to be gained in fooling us. (think parallel construction)

评论 #40745990 未加载

评论 #40745520 未加载

c1sc011 months ago

I sometimes worry that qualms about copyright & ethics will make us lose the machine learning arms race with China. If « The unreasonable effectiveness of data » still holds true then we are in big trouble.

评论 #40746689 未加载

评论 #40746247 未加载

评论 #40746283 未加载

评论 #40746801 未加载

评论 #40746086 未加载

mistrial911 months ago

this article hits many nuanced points well.. and it steers towards a future I can support as a WEIRD (common acronym for Westerners with a higher education?).too many detailed parts to respond to all of them in a brief reply.. I don't think these recommendations will apply to many parts of the non-West world..here in the USA, this content makes a lot of sense to me

评论 #40744907 未加载

评论 #40744888 未加载

m3kw911 months ago

What happened to the generating training data hype train?

numpad011 months ago

These[0][1] tweets showed up to my timeline recently. I don't know it's just an anti-AI luddite propaganda or not, and I do find people who do definitely not fit below caricaturization, but it seem to resonate with my sentiments and geostationary orbit gist of the matter:<pre><code> David Holz mentioned in today’s Midjourney office hours that they have more customers over 45 than under 18. He said you’re more likely to run into a 65-year-old woman than a teenager in MidJourney’s community.I think that’s a really interesting point about generative AI. It’s a bunch of old people who are telling you that it’s some mind-blowing invention. Young people are largely uninterested. They think it’s boomer art, not the guaranteed technology of the future. It’s only cool to olds. The internet was never like that—old people haven’t traditionally driven the culture. Tiktok and YouTube are often associated with younger people. Popular music and movies too. AI seems to only appeal to people who are past their prime. It lacks the It factor for people under 40. 0: https://www.threads.net/@thebrianpenny/post/C8aTGjUycs-/ 1: https://twitter.com/chiefluddite/status/1803704263148970255</code></pre>

diputsmonro11 months ago

> Meanwhile, the AI model developers, who have taken in massive amounts of capital, need to find a business model that will repay all that investment. ...> The extreme case is these companies are no longer allowed to use copyrighted material in building these chatbots. And that means they have to start from scratch. They have to rebuild everything they’ve built. So this is something that not only imperils what they have today, it imperils what they want to build in the future....> "the only practical way for these tools to exist is if they can be trained on massive amounts of data without having to license that data."The simple and correct answer then is that they shouldn't exist. I don't care how much money they spent building a plagiarism machine, the expense and desire to build it doesn't excuse it being a plagiarism machine. Rich VCs don't get to ignore the rules just because they think they found a neat hack to make them billionaires. And after the shenanigans with the OpenAI board, the altruistic "for the betterment of humanity" argument has been revealed to be a thin sham.

评论 #40746779 未加载

renewiltord11 months ago

> When someone reads a book, watches a video, or attends a live training, the copyright holder gets paidBullshit. I have given many a book to a friend and they have passed it on. If this statement were true, then O(n) payments would have been made.