Based on the encyclopedic knowledge LLMs have of written works I assume all parties did the same. But I think there is a broader point to make here. Youtube was initially a ghost town (it started as a dating site) and it only got traction once people started uploading copyrighted TV shows to it. Google itself got big by indexing other people's data without compensation. Spotify's music library was also pirated in the early days. The contracts with the music labels came later. GPL violations by commercial products fits the theme also.<p>Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.
The more I learn about how AI companies trained their models, the more obvious it is that the rest of us are just suckers. We're out here assuming that laws matter, that we should never misrepresent or hide what we're doing for our work, that we should honor our own terms of use and the terms of use of other sites/products, that if we register for a website or piece of content we should always use our work email address so that the person or company on the other side of that exchange can make a reasonable decision about whether we can or should have access to it.<p>What we should have been doing all along is YOLO-ing everything. It's only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.<p>Suckers. All of us.
I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:<p>> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.<p>Following that reference:<p>> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).<p>(Presser, 2020) refers to <a href="https://twitter.com/theshawwn/status/1320282149329784833" rel="nofollow">https://twitter.com/theshawwn/status/1320282149329784833</a>. (Which funnily refers to this DMCA policy: <a href="https://the-eye.eu/dmca.mp4" rel="nofollow">https://the-eye.eu/dmca.mp4</a>)<p>Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.<p>Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.<p>[Touvron et al., 2023] <a href="https://arxiv.org/pdf/2302.13971" rel="nofollow">https://arxiv.org/pdf/2302.13971</a><p>[Gao et al., 2020] <a href="https://arxiv.org/pdf/2101.00027" rel="nofollow">https://arxiv.org/pdf/2101.00027</a>
I strongly urge people to read Thomas Babington Macaulay's speeches on copyright, its aims, terms, and hazards. Very well reasoned and explained.<p>In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.
Libgen is a civilizational project that should be endorsed, not prosecuted. I hope one day people will look at it and think how stupid we were today to shun the largest collection of literary works in human history.
We all like hating big corporations, especially Meta, and people seem to use this as an opportunity to advocate for punishing them. I think it's wiser to advocate for changing our IP laws.
It really makes you think about those crazy internet folks from back in the day who thought copyright law was too strict and that restricting humanity to knowledge in such a way was holding us all back for the benefit of a tiny few.
Beyond illegal downloading and distribution of copyrighted content, the article also describes how Meta staff seemingly lied about it in depositions (including, potentially, Mark Zuckerberg himself).
So if I torrented and seeded, I would be doing it for my own entertainment, not commercially. I expect big copy-write holders to come after myself. If Meta does it - I guess they have better lawyers ?<p>Could make interesting case law.
Is there a concept in the legal system of first-come-first-served that could be used as precedent?<p>What I mean is: when someone is prosecuted for copyright infringement, but Meta isn't, then could the case be put on hold until Meta is found guilty and pays a fine?<p>Also maybe the fine on the later case would have to be proportional to the prior case. So if Meta pays $1 per infringement, the penalty might be $1 for torrenting something else (which is immaterial and not worth the justice system's time) so pretty much all copyright infringement cases would get thrown out.<p>It reminds me of how mainstream drug addicts get convicted and spend years in prison, while celebrities get off with a warning or monetary fine.
Considering prices for single work, this must be multi-billion dollar compensation.<p>Take for example 675k paid for 31 songs. So 20k a song. If we estimate book to be say 10MB that would 8 million works. So I think reasonable compensation is something along 163 billion. Not even 10 years of net income. Which I think is entirely fair punishment.
My ISP will shut off my internet if it catches me torrenting copyrighted material but if you're a massive corporation that steals TBs of data its barely a blip in the news.
This should be legal. Copyright law does more harm than good.<p>The only ethical problem here is that only Meta sized companies can afford to pay the "damages" for such blatant law violations at worst, or the fees of their lawyers at best.
"Supposedly, Meta tried to conceal the seeding by not using Facebook servers while downloading the dataset to "avoid" the "risk" of anyone "tracing back the seeder/downloader" from Facebook servers, an internal message from Meta researcher Frank Zhang said, while describing the work as in "stealth mode." Meta also allegedly modified settings "so that the smallest amount of seeding possible could occur," a Meta executive in charge of project management, Michael Clark, said in a deposition..."<p>They will be getting a lot of Frommer Legal letters...
The question is, if they could and would have paid for each book, would it be ok to train the LLM on them? I'm talking about prior books, I'm sure new books have language forbidding their use to train LLMs at the point of sale.
But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils. Obviously, the LLM can do so at scale, but is there a legal difference?
For some misterious reason I can't see Zuckerberg in front of a judge facing 50 years imprisonment. Anyone can?<p>I truly hope that whoever takes the case goes after Meta with 1000 times the pressure that was put on Swartz, but honestly I don't expect much just as the top comment precisly expressed.<p>And if we are going to be fair please also let's not forget about the other usual suspects, or anyone thinks they are falling behind?
Really curious what the judges are going to do here.<p>Horse has functionally bolted on this already<p>I’m guessing slap on wrist despite courts going after individual for a couple of movies torrented pretty hard
I wonder what happened to the related OpenAI training GPT3 on the books3 dataset story[1] from ~2 years ago?<p>[1]: <a href="https://www.wired.com/story/battle-over-books3/" rel="nofollow">https://www.wired.com/story/battle-over-books3/</a>
I'm more interested in piracy not being highly prosecuted than I am in Meta getting punished for this. I'm not trying to spend 20 years in jail for pirating a TV show.
Support EFF if you think that the copyright laws should be changed and also applied equally to all: <a href="https://www.eff.org/issues/innovation" rel="nofollow">https://www.eff.org/issues/innovation</a>
> By September 2023, Bashlykov had seemingly dropped the emojis, consulting the legal team directly and emphasizing in an email that "using torrents would entail ‘seeding’ the files—i.e., sharing the content outside, this could be legally not OK."<p>I'm pretty sure you can theoretically download torrents without seeding, although this is frowned upon. If they really seeded (with full bandwidth?) that's indeed pretty brazen.<p>It is sort of strange that Meta is being singled out here though, and sort of sad considering they at least release the model weights. What's the signal? Do illegal shit to be competitive, but make sure there is no evidence?
Great, can we get the full Kim Dotcom treatment for Zuckenberg now?<p>I'm also ok with abolishing copyright all together if he's too untouchable
So according to some AI, the damages awarded per infringed work is ~$750 minimum in the US. 80TB of books, each let's say 10MB on average, would be 8 million works. So Meta should pay 6 billion USD for their copyright infringement?
Best way to "punish" Meta is to slash the Gordian knot and abolish copyright. Level the playing field, incrementally, for everyone else who isn't a trillion-dollar corporation.<p>The alternative is a futile legalistic attack against a monopoly entity too powerful to be meaningfully punished. That won't accomplish anything useful. It would, rather, help cement this status quo, where copyright infringement is selectively legal or illegal, for different entities at the same time; and companies like Meta thrive arbitraging that difference. You can't defeat Meta—but you <i>can</i> help dig them a moat.
"Say they hood robin, ain't that a b*, take from the poor and give to the rich."<p>- Ice Cube.<p>Meta will face no consequences. Say your a small publisher and you'd like a bit of compensation. If you dare sue Meta can just blacklist your books on its platforms. Even if they don't, you probably don't have the money to sue one of the biggest companies on earth.<p>I think copyrights should be limited to 25 years after first publication. This would fix plenty of issues and give the AIs of the world plenty to learn from.<p>Who am I kidding, Meta will take what they will. For that author making 20k a year, be honored to be of use to Meta.
Maybe you should go after the worst offender (OpenAI) first before going after Meta, since the latter already gave back their model away for free for everyone and the architecture.<p>We will know why OpenAI isn't getting investigated.
The usual copyright cartel is up in arms, crying theft. But here’s the truth: intellectual property is a state-enforced monopoly, not real property.<p>Property is based on scarcity - if you take my car, I no longer have a car. But if you copy my book, I still have my book. No loss, no theft, just an outdated legal fiction designed to stifle innovation and enrich rent-seeking middlemen. An no, loss of potential sales doesn't count - it's like being able to claim a lottery ticket has real value.<p>Copyright was never about protecting creators—it’s about locking down ideas, preventing competition, and extracting endless fees. Shakespeare borrowed, tech companies iterate, and science thrives on free exchange. The idea that knowledge should be locked away indefinitely is absurd.<p>Meta’s mistake wasn’t using the data - it was pretending copyright still matters. AI is exposing the system for what it is: obsolete. The future belongs to those who create without asking permission.
This reminds me of Peter Sunde's "komimashin"<p><a href="https://www.engadget.com/2015-12-21-peter-sunde-kopimashin.html" rel="nofollow">https://www.engadget.com/2015-12-21-peter-sunde-kopimashin.h...</a><p>It's obviously absurd to enforce copyright as bytes are copied around instead of as it is used. Training an LLM is a different thing than re-hosting and giving away copies to other people.<p>If you don't want people to transform your works - keep them private. You don't own ideas.
Really strange how much torrenting is demonized by all of these companies and ISPs when individuals want to use it but when a company like Meta uses it there is so little scrutiny.
We have at least 4 types of ill-defined concepts of property in the 21st century , largely due to our laziness, intellectual inertia and lack of motivation to make forward-thinking definitions for the coming age of AI and ubiquitous access to all information and all communication.<p>1) the concept of copyright is as old as the word suggests (copies are the least of our worries going forward - it should be possible to define processes for exploitation of ideas in a fair way)<p>2) we allow humans to learn from other people's ideas and transform them to commercial products and the same should happen for AIs in the future<p>3) we have an ill-defined concept of "personally identifying information" which gives people ownership to information that others have created via their own means - there should be better ways to ensure a level of privacy (but not absolute privacy) without overly-broad, nonsensical definitions of what is personally protected information<p>4) We allow social media and other telecommunications media to arbitrarily censor people's speech without recourse. This turns people's speech to property of the social media companies and imposes absolute power on it. This makes zero sense and is abusive towards the public at large. We need legal protections of speech in all media, not just state-owned media.
Who would have known that BitTorrent, shadow libraries, and seeders will help to train the best AI models out there, that adds a whole new meaning to a "seed".
How about a consequentialist argument? In some fields, AI has already surpassed physicians in diagnosing illnesses. If breaking copyright laws allows AI to access and learn from a broader range of data, it could lead to earlier and more accurate diagnoses, saving lives. In this case, the ethical imperative to preserve human life outweighs the rigid enforcement of copyright laws.
If you're an author with a book likely to have be hoovered up, I wonder what you'd get from the fb models if you asked "complete this in the style of [author] in [book]: [quite a long excerpt]"<p>If you get a direct quote then you're good with your claim, surely.
That they’d focus on file sharing over transformation or outputs is exactly the risk I warned the companies about in my AI report. Most datasets, like RefinedWeb and The Pile, also require sharing copyrighted workers between people who are not licensed to do that. Many works also prohibit commercial use or have patents on them.<p>They need to make datasets which don’t have this problem or have entities in Singapore train the foundation models within their rules. The latter has a TDM exemption that would let AI’s use much of the Internet, maybe GPL code, licensed/purchased works they digitize, etc. Very flexible.
I think everyone can see that whatever<p>(imo not in accordance with the Constitution, after absurdities like deciding “limited time” the way mathematicians might define something of some order of infinity)<p>the alleged social contract was is not functional the way it was intended, and we see who benefits and who loses.<p><i>mass dynamic editing for vitriol and profanity occurred while writing this comment in order to remain within site rules</i>
Wow, I'm actually a bit shocked that senior levels of management at Meta were fine with torrenting pirated books. WTaF.<p>Meta does a lot of stuff I disagree with, but they're usually not just straight breaking the law.
LLMs are worse than search for figuring out what value a specific asset provides to the LLM. Atleast with search your work or page is not lost and still gets a click/user interaction, and may be give you a chance to monetize the interaction. However, LLMs just don’t have any such option. Gemini adds links but the links they add are completely editorialized by the LLM and need not reflect the original at all. So how does anyone ask for compensation even if they sue?
Copyright law needs major reform. We need to figure out a way to let authors monetize their work while not making complying with the law so burdensome. We've created a system where people who (understandably) ignore the law benefit at the expense of people trying to do the right thing.
Sounds just like how Facebook got started, harvesting photos without permission. From the Wikipedia article, the Facebook precursor was known as Facemash. On Zuckerberg, "He hacked into the online intranets of Harvard Houses to obtain photos, developing algorithms and codes along the way. He referred to his hacking as "child's play.""<p>If I were younger, I would be livid.
>>"vastly smaller acts of data piracy—just .008 percent of the amount of copyrighted works Meta pirated—have resulted in Judges referring the conduct to the US Attorneys’ office for criminal investigation.".....While Meta may be confident in its legal strategy despite the new torrenting wrinkle...<p>Zuckerberg has paid the vig several times [0,1,2], which is evidently the best legal strategy under this administration. OFC, considering there are already multiple payments, there is no assurance the vig payments won't substantially increase as the Capo sees more opportunity for profit.<p>[0] <a href="https://en.wikipedia.org/wiki/Vigorish" rel="nofollow">https://en.wikipedia.org/wiki/Vigorish</a><p>[1] <a href="https://www.politico.com/news/2025/01/29/meta-settles-trump-facebook-ban-lawsuit-007810" rel="nofollow">https://www.politico.com/news/2025/01/29/meta-settles-trump-...</a><p>[2] <a href="https://www.bbc.com/news/articles/c8j9e1x9z2xo" rel="nofollow">https://www.bbc.com/news/articles/c8j9e1x9z2xo</a>
You know the wierd thing is - I've never used Meta AI. I've never thought of using it. The only product of FB i use is whatsapp, however I've not seen/heard any of my friends using Meta AI for FB,IG,Whatsapp. I really don't understand what their ROI here is...
I thought about it for a full day, and I have one idea for how to handle copyrighted data training.
It would need to be open / regulated and training till double descent would need to be disallowed, to make sure that the model is not memorizing the data.
Damn! One of my old books can be found in the Anna's Archive search. The book has been out of print for years. I pity the Meta users who get results based on something that I wrote.
(Check Anna's for 'Keith P. Graham', and the first book listed is mine.)
At OpenAI we have seen some employees expressed their concern publicy about the moral grounds on which company was acting. We never heard about it from anyone at Meta but there were some jokes ofcourse. I guess everything is fair in AI and Corporates.
One of the largest businesses of the Internet to date has been piracy. Individual informal piracy has been the smallest component of this. By far the largest has been corporate mass-scale piracy, and LLMs are probably the largest heist to date. They've literally downloaded the sum total of all human thought and knowledge, compressed it into queryable lossy compression models (which is what LLMs are), and are selling it back to us.<p>Meta, with its "open weights" models, is one of the least guilty parties, since at least they've made the resulting blobs of mass piracy available to us. Same with Mistral, Deepseek, etc.<p>ClosedAI, Google, and others have all probably done this and more and refuse to make even the model available.<p>I think the way to deal with this is very simple:<p>If you have trained your model on works to which you do not have rights or permission, the resulting model is not copyrightable and cannot be sold. It must either be kept for research purposes only or released free of charge and in the public domain. All these models that have been trained on pirated works should become public domain.<p>Of course now that we have full capture of the US Federal Government I'm sure any suggestion like that would be neutralized with one bribe to Trump.
I’d think people can get together to put this on a public space strictly for training purposes and have the consortium of some sort get paid per use.<p>But we live in this stupid society where you have to move mountains to change things an inch.
I as a individual would be liable to pay ~1000$ of damages if I'd downloaded a movie in Germany or Poland and the publisher would get to me.<p>I'm going to assume as it's a corporation, then the laws no longer apply.
The only bad thing about this is that small time players who do it are treated poorly (Aaron Swartz). IP de-facto not existing for AI companies is a feature, not a bug.<p>The fact that most of the world embraced hardcore copyright troll ludditism when the means of their (badly paying creative) jobs economic production was democratized implies that most people do not believe in any "egalitarianism" and especially not the left-wing form many profess to believe in. Certainly not "information wants to be free" or any of the other idealist shit that I or Aaron Swartz believed in. What meta did was software communism - full stop. They literally released their models to the public! I support all of this 10000%. The only issue is that they're not open enough (fully open source the dataset)<p>So, unironically, good! Thank you, please pirate more! Please destroy the US IP system while you're at it. Copyright abolitionism is good and thank you Zuckerberg!
We're grateful to Meta for helping seed and backup our torrents. The more copies the better. Thank you Meta, for helping preserve humanity's legacy! :)
Copy-right is not learn/train-right. That said Meta full its mouth with open source while they release models that are not SOTA nor usable for commercial purposes.
Wouldn't it be a real shame if the entirety of US constitution, laws, and legal precedence went out the window these days, and the only thing left unscathed was the rotten mess that is copyright law? Just saying, this might be the moment to burn it to the ground. Not that it makes up for any of the other stuff going on, but why waste a perfectly good crisis?
We're starting to find out that Meta ruined LibGen for the rest of use who used it like a library.
Just like how Google screwed over libraries by sending interns to the Stanford library to checkout books they scanned into Google Books.
Not to increase shared knowledge or preserve human artificats,
but to put them all in a museum and, to paraphrase Joni Mitchell,
charge the people a dollar and a half just to see 'em.
Yes it smells bad but facebook did the right thing (at least for facebook)<p>After OpenAI trained their models on the famed <i>books2</i> dataset, and seeing the technological implications of ChatGPT, there was a good chance they would let them get away with it.<p>Would the USA really surrender its AI technological advantage for trivial matters like copyright? They would make some royalty arrangement and get it over with
Remember people getting sued insane amounts of money per-song they torrented. If we applied that precedent to Meta, Meta would need to declare bankruptcy. <a href="https://www.cbsnews.com/news/file-sharing-mom-fined-19-million/" rel="nofollow">https://www.cbsnews.com/news/file-sharing-mom-fined-19-milli...</a>
Yeah well, OpenAI compressed the whole internet into proprietary weights and is now providing access via paid subscription while the original internet gets deleted from our culture.
Come on publishers! This is your chance! Now you can really show, how you will treat all copyright infringements equally and not only go after easy target. Show us, how you spend all that money in a lawsuit against Meta!
So they're gonna go through every book that was stolen and apply the appropriate penalty, right? Each copyrighted work has a minimum penalty of $750 under the DMCA. That will be applied fairly in order to ensure that the rights holder is made whole by the infringer, right?<p>It's so funny to see the law blatantly ignored by the overlords. Like, there isn't even a pretext anymore. They just steal what they want and budget for the fines and campaign donations to make the consequences go away.
One of the many reasons why Zuck’s been sucking up to Trump. He’s in desperate need of some Get-Out-Of-Jail-Free cards.<p>Same for all the other sleazy tech bros.
Boo hoo.<p>We are trying to advance civilization here. To accumulate and make available all human knowledge to date. And you stand there with your hand out to stop this? You are a villain. There is no sympathy for you.
I deleted my facebook account about 10 years ago. Downloaded data, deleted. Not deactivated.<p>Nothing in my life made me ever want to go back except for when I got back into playing hockey, and all the hockey leagues use facebook to communicate a few months ago.<p>I made a new account, had to literally upload a picture of my face to pass verification.. and then a few days later I was immediately banned and couldn't use my account. I assume because they searched previous data and compared my face to find out I have a "deleted" (lol) account and matched me. I've assumed they'll only let me log in if i use my original 10 years ago deleted account.<p>Fuck meta. Fuck zuck.
And they're going to get away with it simply because if you or I openly did this the DMCA fines would be for a million trillion dollars. Since Meta shareholders can't stomach a million trillion dollars in fines, their lawyers will wave their magic wands and poof! No laws were broken!
Nothing is gonna happen.
Just a slap on the hand.
And we all from the intelectual work class, writers, journalists, programmers will be proletarized by LLMs that have been:<p>a) Financed via inflation/"cantillon effect" due to ZRP/Stimulus that absolutely flooded the market with funny money in the hand of the sharks.
b) Trained upon copyrighted work without compensation.
c) Trained upon open source without even asking politely for authorization.<p>The Robber Barons from the last century can't even get close to our modern Feudal Tech Lords.<p>Unless you're one of us that have amassed multi-generation wealth in a exit in the last 20 years, you're completely fucked.