TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Meta torrented & seeded 81.7 TB dataset containing copyrighted data

1270 点作者 gameshot9113 个月前

96 条评论

gizmo3 个月前
Based on the encyclopedic knowledge LLMs have of written works I assume all parties did the same. But I think there is a broader point to make here. Youtube was initially a ghost town (it started as a dating site) and it only got traction once people started uploading copyrighted TV shows to it. Google itself got big by indexing other people&#x27;s data without compensation. Spotify&#x27;s music library was also pirated in the early days. The contracts with the music labels came later. GPL violations by commercial products fits the theme also.<p>Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.
评论 #42973255 未加载
评论 #42972469 未加载
评论 #42973774 未加载
评论 #42973453 未加载
评论 #42971823 未加载
评论 #42973264 未加载
评论 #42972523 未加载
评论 #42972406 未加载
评论 #42972817 未加载
评论 #42971881 未加载
评论 #42973219 未加载
评论 #42972252 未加载
评论 #42974396 未加载
评论 #42974320 未加载
评论 #42972940 未加载
评论 #42972018 未加载
评论 #42976873 未加载
评论 #42973388 未加载
评论 #42976954 未加载
评论 #42973454 未加载
评论 #42972095 未加载
评论 #42972314 未加载
评论 #42977099 未加载
评论 #42974348 未加载
评论 #42978844 未加载
评论 #42975932 未加载
评论 #42974303 未加载
评论 #42971980 未加载
评论 #42974292 未加载
评论 #42972512 未加载
评论 #42971770 未加载
评论 #42971749 未加载
评论 #42981611 未加载
评论 #42974323 未加载
评论 #42976008 未加载
评论 #42973488 未加载
评论 #42975192 未加载
评论 #42982009 未加载
评论 #42978613 未加载
评论 #42973508 未加载
评论 #42979134 未加载
评论 #42974448 未加载
评论 #42974581 未加载
评论 #42972564 未加载
评论 #42978716 未加载
评论 #42971834 未加载
评论 #42974288 未加载
评论 #42973230 未加载
评论 #42972869 未加载
评论 #42971846 未加载
peterbonney3 个月前
The more I learn about how AI companies trained their models, the more obvious it is that the rest of us are just suckers. We&#x27;re out here assuming that laws matter, that we should never misrepresent or hide what we&#x27;re doing for our work, that we should honor our own terms of use and the terms of use of other sites&#x2F;products, that if we register for a website or piece of content we should always use our work email address so that the person or company on the other side of that exchange can make a reasonable decision about whether we can or should have access to it.<p>What we should have been doing all along is YOLO-ing everything. It&#x27;s only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.<p>Suckers. All of us.
评论 #42975369 未加载
评论 #42978496 未加载
评论 #42975398 未加载
评论 #42978788 未加载
评论 #42975293 未加载
评论 #42977304 未加载
评论 #42976829 未加载
JW_000003 个月前
I don&#x27;t understand why it&#x27;s even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:<p>&gt; We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.<p>Following that reference:<p>&gt; Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).<p>(Presser, 2020) refers to <a href="https:&#x2F;&#x2F;twitter.com&#x2F;theshawwn&#x2F;status&#x2F;1320282149329784833" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;theshawwn&#x2F;status&#x2F;1320282149329784833</a>. (Which funnily refers to this DMCA policy: <a href="https:&#x2F;&#x2F;the-eye.eu&#x2F;dmca.mp4" rel="nofollow">https:&#x2F;&#x2F;the-eye.eu&#x2F;dmca.mp4</a>)<p>Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.<p>Surely the question is: is it legal to train and&#x2F;or use and&#x2F;or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.<p>[Touvron et al., 2023] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2302.13971" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2302.13971</a><p>[Gao et al., 2020] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2101.00027" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2101.00027</a>
评论 #42973314 未加载
评论 #42973378 未加载
评论 #42980385 未加载
peterclary3 个月前
I strongly urge people to read Thomas Babington Macaulay&#x27;s speeches on copyright, its aims, terms, and hazards. Very well reasoned and explained.<p>In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.
评论 #42972788 未加载
评论 #42972735 未加载
评论 #42972938 未加载
评论 #42973716 未加载
mik19983 个月前
Libgen is a civilizational project that should be endorsed, not prosecuted. I hope one day people will look at it and think how stupid we were today to shun the largest collection of literary works in human history.
评论 #42973950 未加载
评论 #42974765 未加载
评论 #42972610 未加载
评论 #42972104 未加载
yoavm3 个月前
We all like hating big corporations, especially Meta, and people seem to use this as an opportunity to advocate for punishing them. I think it&#x27;s wiser to advocate for changing our IP laws.
评论 #42972474 未加载
评论 #42972480 未加载
评论 #42971920 未加载
评论 #42972539 未加载
评论 #42971804 未加载
评论 #42972500 未加载
评论 #42971822 未加载
评论 #42974411 未加载
评论 #42972698 未加载
评论 #42972236 未加载
评论 #42971894 未加载
fimdomeio3 个月前
It really makes you think about those crazy internet folks from back in the day who thought copyright law was too strict and that restricting humanity to knowledge in such a way was holding us all back for the benefit of a tiny few.
评论 #42972513 未加载
评论 #42971738 未加载
评论 #42972749 未加载
评论 #42973761 未加载
评论 #42976203 未加载
gameshot9113 个月前
Beyond illegal downloading and distribution of copyrighted content, the article also describes how Meta staff seemingly lied about it in depositions (including, potentially, Mark Zuckerberg himself).
评论 #42972346 未加载
bmsleight_3 个月前
So if I torrented and seeded, I would be doing it for my own entertainment, not commercially. I expect big copy-write holders to come after myself. If Meta does it - I guess they have better lawyers ?<p>Could make interesting case law.
评论 #42971611 未加载
nyoomboom3 个月前
Remembering Aaron Swartz in this moment
评论 #42971655 未加载
评论 #42973849 未加载
zackmorris3 个月前
Is there a concept in the legal system of first-come-first-served that could be used as precedent?<p>What I mean is: when someone is prosecuted for copyright infringement, but Meta isn&#x27;t, then could the case be put on hold until Meta is found guilty and pays a fine?<p>Also maybe the fine on the later case would have to be proportional to the prior case. So if Meta pays $1 per infringement, the penalty might be $1 for torrenting something else (which is immaterial and not worth the justice system&#x27;s time) so pretty much all copyright infringement cases would get thrown out.<p>It reminds me of how mainstream drug addicts get convicted and spend years in prison, while celebrities get off with a warning or monetary fine.
评论 #42981188 未加载
Ekaros3 个月前
Considering prices for single work, this must be multi-billion dollar compensation.<p>Take for example 675k paid for 31 songs. So 20k a song. If we estimate book to be say 10MB that would 8 million works. So I think reasonable compensation is something along 163 billion. Not even 10 years of net income. Which I think is entirely fair punishment.
评论 #42972535 未加载
评论 #42972417 未加载
评论 #42971742 未加载
panki273 个月前
They could have at the very least seeded some more, to give something back to the, uh, community.
RobotToaster3 个月前
Before I decided my opinion on this I need to know their ratio.
评论 #42971638 未加载
wnevets3 个月前
My ISP will shut off my internet if it catches me torrenting copyrighted material but if you&#x27;re a massive corporation that steals TBs of data its barely a blip in the news.
评论 #42974501 未加载
评论 #42975493 未加载
lrvick3 个月前
This should be legal. Copyright law does more harm than good.<p>The only ethical problem here is that only Meta sized companies can afford to pay the &quot;damages&quot; for such blatant law violations at worst, or the fees of their lawyers at best.
评论 #42977333 未加载
评论 #42975870 未加载
评论 #42977531 未加载
belter3 个月前
&quot;Supposedly, Meta tried to conceal the seeding by not using Facebook servers while downloading the dataset to &quot;avoid&quot; the &quot;risk&quot; of anyone &quot;tracing back the seeder&#x2F;downloader&quot; from Facebook servers, an internal message from Meta researcher Frank Zhang said, while describing the work as in &quot;stealth mode.&quot; Meta also allegedly modified settings &quot;so that the smallest amount of seeding possible could occur,&quot; a Meta executive in charge of project management, Michael Clark, said in a deposition...&quot;<p>They will be getting a lot of Frommer Legal letters...
bigmattystyles3 个月前
The question is, if they could and would have paid for each book, would it be ok to train the LLM on them? I&#x27;m talking about prior books, I&#x27;m sure new books have language forbidding their use to train LLMs at the point of sale. But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils. Obviously, the LLM can do so at scale, but is there a legal difference?
评论 #42974894 未加载
评论 #42974665 未加载
liendolucas3 个月前
For some misterious reason I can&#x27;t see Zuckerberg in front of a judge facing 50 years imprisonment. Anyone can?<p>I truly hope that whoever takes the case goes after Meta with 1000 times the pressure that was put on Swartz, but honestly I don&#x27;t expect much just as the top comment precisly expressed.<p>And if we are going to be fair please also let&#x27;s not forget about the other usual suspects, or anyone thinks they are falling behind?
评论 #42978386 未加载
Havoc3 个月前
Really curious what the judges are going to do here.<p>Horse has functionally bolted on this already<p>I’m guessing slap on wrist despite courts going after individual for a couple of movies torrented pretty hard
评论 #42971824 未加载
评论 #42977138 未加载
评论 #42971619 未加载
评论 #42971601 未加载
woadwarrior013 个月前
I wonder what happened to the related OpenAI training GPT3 on the books3 dataset story[1] from ~2 years ago?<p>[1]: <a href="https:&#x2F;&#x2F;www.wired.com&#x2F;story&#x2F;battle-over-books3&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.wired.com&#x2F;story&#x2F;battle-over-books3&#x2F;</a>
评论 #42976799 未加载
ksynwa3 个月前
A good chance for federal prosectutors to &quot;send a message&quot; as they did with Aaron Swartz but I don&#x27;t see things going that way.
评论 #42972073 未加载
评论 #42978485 未加载
评论 #42971660 未加载
openplatypus3 个月前
Something tells me uncle Donald will exonerate his new favourite lapdog from any criminal or civil liability.
评论 #42977764 未加载
HPsquared3 个月前
If you owe the bank $1,000 it&#x27;s your problem; if you owe the bank $1,000,000,000 it&#x27;s the bank&#x27;s problem.
653 个月前
I&#x27;m more interested in piracy not being highly prosecuted than I am in Meta getting punished for this. I&#x27;m not trying to spend 20 years in jail for pirating a TV show.
fsflover3 个月前
Support EFF if you think that the copyright laws should be changed and also applied equally to all: <a href="https:&#x2F;&#x2F;www.eff.org&#x2F;issues&#x2F;innovation" rel="nofollow">https:&#x2F;&#x2F;www.eff.org&#x2F;issues&#x2F;innovation</a>
sva_3 个月前
&gt; By September 2023, Bashlykov had seemingly dropped the emojis, consulting the legal team directly and emphasizing in an email that &quot;using torrents would entail ‘seeding’ the files—i.e., sharing the content outside, this could be legally not OK.&quot;<p>I&#x27;m pretty sure you can theoretically download torrents without seeding, although this is frowned upon. If they really seeded (with full bandwidth?) that&#x27;s indeed pretty brazen.<p>It is sort of strange that Meta is being singled out here though, and sort of sad considering they at least release the model weights. What&#x27;s the signal? Do illegal shit to be competitive, but make sure there is no evidence?
评论 #42973466 未加载
jokethrowaway3 个月前
Great, can we get the full Kim Dotcom treatment for Zuckenberg now?<p>I&#x27;m also ok with abolishing copyright all together if he&#x27;s too untouchable
mnsu3 个月前
So according to some AI, the damages awarded per infringed work is ~$750 minimum in the US. 80TB of books, each let&#x27;s say 10MB on average, would be 8 million works. So Meta should pay 6 billion USD for their copyright infringement?
评论 #42971715 未加载
评论 #42971580 未加载
评论 #42972979 未加载
perihelions3 个月前
Best way to &quot;punish&quot; Meta is to slash the Gordian knot and abolish copyright. Level the playing field, incrementally, for everyone else who isn&#x27;t a trillion-dollar corporation.<p>The alternative is a futile legalistic attack against a monopoly entity too powerful to be meaningfully punished. That won&#x27;t accomplish anything useful. It would, rather, help cement this status quo, where copyright infringement is selectively legal or illegal, for different entities at the same time; and companies like Meta thrive arbitraging that difference. You can&#x27;t defeat Meta—but you <i>can</i> help dig them a moat.
评论 #42971910 未加载
评论 #42973052 未加载
9999000009993 个月前
&quot;Say they hood robin, ain&#x27;t that a b*, take from the poor and give to the rich.&quot;<p>- Ice Cube.<p>Meta will face no consequences. Say your a small publisher and you&#x27;d like a bit of compensation. If you dare sue Meta can just blacklist your books on its platforms. Even if they don&#x27;t, you probably don&#x27;t have the money to sue one of the biggest companies on earth.<p>I think copyrights should be limited to 25 years after first publication. This would fix plenty of issues and give the AIs of the world plenty to learn from.<p>Who am I kidding, Meta will take what they will. For that author making 20k a year, be honored to be of use to Meta.
评论 #42974363 未加载
rvz3 个月前
Maybe you should go after the worst offender (OpenAI) first before going after Meta, since the latter already gave back their model away for free for everyone and the architecture.<p>We will know why OpenAI isn&#x27;t getting investigated.
评论 #42972568 未加载
评论 #42975373 未加载
评论 #42974027 未加载
postepowanieadm3 个月前
That&#x27;s horrible! Magnet anyone?
评论 #42971778 未加载
评论 #42971748 未加载
kelseyfrog3 个月前
The usual copyright cartel is up in arms, crying theft. But here’s the truth: intellectual property is a state-enforced monopoly, not real property.<p>Property is based on scarcity - if you take my car, I no longer have a car. But if you copy my book, I still have my book. No loss, no theft, just an outdated legal fiction designed to stifle innovation and enrich rent-seeking middlemen. An no, loss of potential sales doesn&#x27;t count - it&#x27;s like being able to claim a lottery ticket has real value.<p>Copyright was never about protecting creators—it’s about locking down ideas, preventing competition, and extracting endless fees. Shakespeare borrowed, tech companies iterate, and science thrives on free exchange. The idea that knowledge should be locked away indefinitely is absurd.<p>Meta’s mistake wasn’t using the data - it was pretending copyright still matters. AI is exposing the system for what it is: obsolete. The future belongs to those who create without asking permission.
abigail953 个月前
This reminds me of Peter Sunde&#x27;s &quot;komimashin&quot;<p><a href="https:&#x2F;&#x2F;www.engadget.com&#x2F;2015-12-21-peter-sunde-kopimashin.html" rel="nofollow">https:&#x2F;&#x2F;www.engadget.com&#x2F;2015-12-21-peter-sunde-kopimashin.h...</a><p>It&#x27;s obviously absurd to enforce copyright as bytes are copied around instead of as it is used. Training an LLM is a different thing than re-hosting and giving away copies to other people.<p>If you don&#x27;t want people to transform your works - keep them private. You don&#x27;t own ideas.
评论 #42976895 未加载
评论 #42978901 未加载
caterwhal3 个月前
Really strange how much torrenting is demonized by all of these companies and ISPs when individuals want to use it but when a company like Meta uses it there is so little scrutiny.
seydor3 个月前
We have at least 4 types of ill-defined concepts of property in the 21st century , largely due to our laziness, intellectual inertia and lack of motivation to make forward-thinking definitions for the coming age of AI and ubiquitous access to all information and all communication.<p>1) the concept of copyright is as old as the word suggests (copies are the least of our worries going forward - it should be possible to define processes for exploitation of ideas in a fair way)<p>2) we allow humans to learn from other people&#x27;s ideas and transform them to commercial products and the same should happen for AIs in the future<p>3) we have an ill-defined concept of &quot;personally identifying information&quot; which gives people ownership to information that others have created via their own means - there should be better ways to ensure a level of privacy (but not absolute privacy) without overly-broad, nonsensical definitions of what is personally protected information<p>4) We allow social media and other telecommunications media to arbitrarily censor people&#x27;s speech without recourse. This turns people&#x27;s speech to property of the social media companies and imposes absolute power on it. This makes zero sense and is abusive towards the public at large. We need legal protections of speech in all media, not just state-owned media.
评论 #42972737 未加载
ofou3 个月前
Who would have known that BitTorrent, shadow libraries, and seeders will help to train the best AI models out there, that adds a whole new meaning to a &quot;seed&quot;.
gorbachev3 个月前
Previous: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42673628">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42673628</a>
z73 个月前
How about a consequentialist argument? In some fields, AI has already surpassed physicians in diagnosing illnesses. If breaking copyright laws allows AI to access and learn from a broader range of data, it could lead to earlier and more accurate diagnoses, saving lives. In this case, the ethical imperative to preserve human life outweighs the rigid enforcement of copyright laws.
评论 #42973081 未加载
nprateem3 个月前
If you&#x27;re an author with a book likely to have be hoovered up, I wonder what you&#x27;d get from the fb models if you asked &quot;complete this in the style of [author] in [book]: [quite a long excerpt]&quot;<p>If you get a direct quote then you&#x27;re good with your claim, surely.
评论 #42973765 未加载
评论 #42973928 未加载
评论 #42971831 未加载
aucisson_masque3 个月前
You wouldn&#x27;t download a car.
nickpsecurity3 个月前
That they’d focus on file sharing over transformation or outputs is exactly the risk I warned the companies about in my AI report. Most datasets, like RefinedWeb and The Pile, also require sharing copyrighted workers between people who are not licensed to do that. Many works also prohibit commercial use or have patents on them.<p>They need to make datasets which don’t have this problem or have entities in Singapore train the foundation models within their rules. The latter has a TDM exemption that would let AI’s use much of the Internet, maybe GPL code, licensed&#x2F;purchased works they digitize, etc. Very flexible.
nullfield3 个月前
I think everyone can see that whatever<p>(imo not in accordance with the Constitution, after absurdities like deciding “limited time” the way mathematicians might define something of some order of infinity)<p>the alleged social contract was is not functional the way it was intended, and we see who benefits and who loses.<p><i>mass dynamic editing for vitriol and profanity occurred while writing this comment in order to remain within site rules</i>
stevage3 个月前
Wow, I&#x27;m actually a bit shocked that senior levels of management at Meta were fine with torrenting pirated books. WTaF.<p>Meta does a lot of stuff I disagree with, but they&#x27;re usually not just straight breaking the law.
passwordoops3 个月前
Eye for an eye. Meta losses rights to 81.7 TB of IP. Transcribed into a text file
评论 #42971590 未加载
scotty793 个月前
Seeding it was probably most societally useful thing Meta ever did.
yalogin3 个月前
LLMs are worse than search for figuring out what value a specific asset provides to the LLM. Atleast with search your work or page is not lost and still gets a click&#x2F;user interaction, and may be give you a chance to monetize the interaction. However, LLMs just don’t have any such option. Gemini adds links but the links they add are completely editorialized by the LLM and need not reflect the original at all. So how does anyone ask for compensation even if they sue?
pjfin1233 个月前
Copyright law needs major reform. We need to figure out a way to let authors monetize their work while not making complying with the law so burdensome. We&#x27;ve created a system where people who (understandably) ignore the law benefit at the expense of people trying to do the right thing.
ngneer3 个月前
Sounds just like how Facebook got started, harvesting photos without permission. From the Wikipedia article, the Facebook precursor was known as Facemash. On Zuckerberg, &quot;He hacked into the online intranets of Harvard Houses to obtain photos, developing algorithms and codes along the way. He referred to his hacking as &quot;child&#x27;s play.&quot;&quot;<p>If I were younger, I would be livid.
toss13 个月前
&gt;&gt;&quot;vastly smaller acts of data piracy—just .008 percent of the amount of copyrighted works Meta pirated—have resulted in Judges referring the conduct to the US Attorneys’ office for criminal investigation.&quot;.....While Meta may be confident in its legal strategy despite the new torrenting wrinkle...<p>Zuckerberg has paid the vig several times [0,1,2], which is evidently the best legal strategy under this administration. OFC, considering there are already multiple payments, there is no assurance the vig payments won&#x27;t substantially increase as the Capo sees more opportunity for profit.<p>[0] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Vigorish" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Vigorish</a><p>[1] <a href="https:&#x2F;&#x2F;www.politico.com&#x2F;news&#x2F;2025&#x2F;01&#x2F;29&#x2F;meta-settles-trump-facebook-ban-lawsuit-007810" rel="nofollow">https:&#x2F;&#x2F;www.politico.com&#x2F;news&#x2F;2025&#x2F;01&#x2F;29&#x2F;meta-settles-trump-...</a><p>[2] <a href="https:&#x2F;&#x2F;www.bbc.com&#x2F;news&#x2F;articles&#x2F;c8j9e1x9z2xo" rel="nofollow">https:&#x2F;&#x2F;www.bbc.com&#x2F;news&#x2F;articles&#x2F;c8j9e1x9z2xo</a>
buyucu3 个月前
I love this. Large corpos should torrent more. Maybe we&#x27;ll get better copyright law as a result.
thunder-blue-33 个月前
You know the wierd thing is - I&#x27;ve never used Meta AI. I&#x27;ve never thought of using it. The only product of FB i use is whatsapp, however I&#x27;ve not seen&#x2F;heard any of my friends using Meta AI for FB,IG,Whatsapp. I really don&#x27;t understand what their ROI here is...
asjir3 个月前
I thought about it for a full day, and I have one idea for how to handle copyrighted data training. It would need to be open &#x2F; regulated and training till double descent would need to be disallowed, to make sure that the model is not memorizing the data.
kpgraham3 个月前
Damn! One of my old books can be found in the Anna&#x27;s Archive search. The book has been out of print for years. I pity the Meta users who get results based on something that I wrote. (Check Anna&#x27;s for &#x27;Keith P. Graham&#x27;, and the first book listed is mine.)
srameshc3 个月前
At OpenAI we have seen some employees expressed their concern publicy about the moral grounds on which company was acting. We never heard about it from anyone at Meta but there were some jokes ofcourse. I guess everything is fair in AI and Corporates.
api3 个月前
One of the largest businesses of the Internet to date has been piracy. Individual informal piracy has been the smallest component of this. By far the largest has been corporate mass-scale piracy, and LLMs are probably the largest heist to date. They&#x27;ve literally downloaded the sum total of all human thought and knowledge, compressed it into queryable lossy compression models (which is what LLMs are), and are selling it back to us.<p>Meta, with its &quot;open weights&quot; models, is one of the least guilty parties, since at least they&#x27;ve made the resulting blobs of mass piracy available to us. Same with Mistral, Deepseek, etc.<p>ClosedAI, Google, and others have all probably done this and more and refuse to make even the model available.<p>I think the way to deal with this is very simple:<p>If you have trained your model on works to which you do not have rights or permission, the resulting model is not copyrightable and cannot be sold. It must either be kept for research purposes only or released free of charge and in the public domain. All these models that have been trained on pirated works should become public domain.<p>Of course now that we have full capture of the US Federal Government I&#x27;m sure any suggestion like that would be neutralized with one bribe to Trump.
flojo3 个月前
Did they at least seed back?
lvl1553 个月前
I’d think people can get together to put this on a public space strictly for training purposes and have the consortium of some sort get paid per use.<p>But we live in this stupid society where you have to move mountains to change things an inch.
StefanBatory3 个月前
I as a individual would be liable to pay ~1000$ of damages if I&#x27;d downloaded a movie in Germany or Poland and the publisher would get to me.<p>I&#x27;m going to assume as it&#x27;s a corporation, then the laws no longer apply.
评论 #42977685 未加载
Der_Einzige3 个月前
The only bad thing about this is that small time players who do it are treated poorly (Aaron Swartz). IP de-facto not existing for AI companies is a feature, not a bug.<p>The fact that most of the world embraced hardcore copyright troll ludditism when the means of their (badly paying creative) jobs economic production was democratized implies that most people do not believe in any &quot;egalitarianism&quot; and especially not the left-wing form many profess to believe in. Certainly not &quot;information wants to be free&quot; or any of the other idealist shit that I or Aaron Swartz believed in. What meta did was software communism - full stop. They literally released their models to the public! I support all of this 10000%. The only issue is that they&#x27;re not open enough (fully open source the dataset)<p>So, unironically, good! Thank you, please pirate more! Please destroy the US IP system while you&#x27;re at it. Copyright abolitionism is good and thank you Zuckerberg!
pilimi_anna3 个月前
We&#x27;re grateful to Meta for helping seed and backup our torrents. The more copies the better. Thank you Meta, for helping preserve humanity&#x27;s legacy! :)
djyaz12003 个月前
“Behind every great fortune lies a great crime” -Honoré de Balzac
antirez3 个月前
Copy-right is not learn&#x2F;train-right. That said Meta full its mouth with open source while they release models that are not SOTA nor usable for commercial purposes.
black_puppydog3 个月前
Wouldn&#x27;t it be a real shame if the entirety of US constitution, laws, and legal precedence went out the window these days, and the only thing left unscathed was the rotten mess that is copyright law? Just saying, this might be the moment to burn it to the ground. Not that it makes up for any of the other stuff going on, but why waste a perfectly good crisis?
maxwell3 个月前
I&#x27;m sure they&#x27;ll throw the book at them.
cratermoon3 个月前
We&#x27;re starting to find out that Meta ruined LibGen for the rest of use who used it like a library. Just like how Google screwed over libraries by sending interns to the Stanford library to checkout books they scanned into Google Books. Not to increase shared knowledge or preserve human artificats, but to put them all in a museum and, to paraphrase Joni Mitchell, charge the people a dollar and a half just to see &#x27;em.
ezekiel683 个月前
Unless Meta &#x27;fessed up to this (which seems unlikely), the headline here is missing the word &quot;allegedly&quot;.
评论 #42980094 未加载
esarbe3 个月前
It&#x27;s okay - they are multi-billion company. Rules don&#x27;t apply to them.<p>Rules are just for us peasants.
dansitu3 个月前
I&#x27;m fine with them using my books to train an open source model, but it would have been nice to be asked.
评论 #42974250 未加载
lewdev3 个月前
It&#x27;s okay when large corporations download cars. But when you do it, you&#x27;ll be in trouble.
iimaginary3 个月前
We need better laws that would create a better way to do this legally whilst compensating rights holders.
评论 #42971919 未加载
评论 #42972388 未加载
breppp3 个月前
Yes it smells bad but facebook did the right thing (at least for facebook)<p>After OpenAI trained their models on the famed <i>books2</i> dataset, and seeing the technological implications of ChatGPT, there was a good chance they would let them get away with it.<p>Would the USA really surrender its AI technological advantage for trivial matters like copyright? They would make some royalty arrangement and get it over with
mrinterweb3 个月前
Remember people getting sued insane amounts of money per-song they torrented. If we applied that precedent to Meta, Meta would need to declare bankruptcy. <a href="https:&#x2F;&#x2F;www.cbsnews.com&#x2F;news&#x2F;file-sharing-mom-fined-19-million&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.cbsnews.com&#x2F;news&#x2F;file-sharing-mom-fined-19-milli...</a>
ofslidingfeet3 个月前
Yeah well, OpenAI compressed the whole internet into proprietary weights and is now providing access via paid subscription while the original internet gets deleted from our culture.
josefritzishere3 个月前
Zuckerberg did more copyright infringement? Shocking!
losvedir3 个月前
Hooray! Or wait, are we not doing that anymore?
waltercool3 个月前
Based. Free knowledge to the people
zelphirkalt3 个月前
Come on publishers! This is your chance! Now you can really show, how you will treat all copyright infringements equally and not only go after easy target. Show us, how you spend all that money in a lawsuit against Meta!
bloopbloopscoop3 个月前
Death to intellectual property!
tremarley3 个月前
ebooks are a 1-2 mb each max. 81.7 TB are a lot of books, like 42-85 million books.
评论 #42971606 未加载
评论 #42971510 未加载
评论 #42971558 未加载
Refusing233 个月前
their whole business is stealing data..<p>so its quite funny to see they freely share it too.
ocean_moist3 个月前
At least they seeded!
snapcaster3 个月前
The powerful do what they can, the weak suffer what they must
jfbaro3 个月前
They are getting shittier and shittier
reverendsteveii3 个月前
So they&#x27;re gonna go through every book that was stolen and apply the appropriate penalty, right? Each copyrighted work has a minimum penalty of $750 under the DMCA. That will be applied fairly in order to ensure that the rights holder is made whole by the infringer, right?<p>It&#x27;s so funny to see the law blatantly ignored by the overlords. Like, there isn&#x27;t even a pretext anymore. They just steal what they want and budget for the fines and campaign donations to make the consequences go away.
uncomplexity_3 个月前
did they not seed enough, is that the crime? lol
Pxtl3 个月前
Laws are for poor people.
TZubiri3 个月前
I love it. This plotline feels out of cryptonomicon or silicon valley series.
hackerbeat3 个月前
One of the many reasons why Zuck’s been sucking up to Trump. He’s in desperate need of some Get-Out-Of-Jail-Free cards.<p>Same for all the other sleazy tech bros.
lazycog5123 个月前
abolish knowledge rentiers
imgabe3 个月前
Boo hoo.<p>We are trying to advance civilization here. To accumulate and make available all human knowledge to date. And you stand there with your hand out to stop this? You are a villain. There is no sympathy for you.
palata3 个月前
Good, we know it. Nothing will happen, because nothing happens to billionaires and their companies. Musk is proving it every day now.
评论 #42972189 未加载
swozey3 个月前
I deleted my facebook account about 10 years ago. Downloaded data, deleted. Not deactivated.<p>Nothing in my life made me ever want to go back except for when I got back into playing hockey, and all the hockey leagues use facebook to communicate a few months ago.<p>I made a new account, had to literally upload a picture of my face to pass verification.. and then a few days later I was immediately banned and couldn&#x27;t use my account. I assume because they searched previous data and compared my face to find out I have a &quot;deleted&quot; (lol) account and matched me. I&#x27;ve assumed they&#x27;ll only let me log in if i use my original 10 years ago deleted account.<p>Fuck meta. Fuck zuck.
1970-01-013 个月前
And they&#x27;re going to get away with it simply because if you or I openly did this the DMCA fines would be for a million trillion dollars. Since Meta shareholders can&#x27;t stomach a million trillion dollars in fines, their lawyers will wave their magic wands and poof! No laws were broken!
elzbardico3 个月前
Nothing is gonna happen. Just a slap on the hand. And we all from the intelectual work class, writers, journalists, programmers will be proletarized by LLMs that have been:<p>a) Financed via inflation&#x2F;&quot;cantillon effect&quot; due to ZRP&#x2F;Stimulus that absolutely flooded the market with funny money in the hand of the sharks. b) Trained upon copyrighted work without compensation. c) Trained upon open source without even asking politely for authorization.<p>The Robber Barons from the last century can&#x27;t even get close to our modern Feudal Tech Lords.<p>Unless you&#x27;re one of us that have amassed multi-generation wealth in a exit in the last 20 years, you&#x27;re completely fucked.