Microsofts AI boss thinks its perfectly OK to steal content if its on open web

70 pointsby avivallssa11 months ago

19 comments

From the beginning, it's seemed completely intuitive to me that training a computer made of sand on publicly available content and then generating art later should be fair use, so long as it's fair use to train the meat computer in your head on the same content and then use it to generate art later. There's no meaningful difference to me as far as the ethics of the act are concerned.

评论 #40834284 未加载

评论 #40834158 未加载

评论 #40840356 未加载

评论 #40836954 未加载

评论 #40843230 未加载

评论 #40834232 未加载

jsyang0011 months ago

No he doesn't.> I think that with respect to content that’s already on the open web, the social contract of that content since the ‘90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been “freeware,” if you like, that’s been the understanding.> There’s a separate category where a website, or a publisher, or a news organization had explicitly said ‘do not scrape or crawl me for any other reason than indexing me so that other people can find this content.’ That’s a grey area, and I think it’s going to work its way through the courts.

评论 #40833787 未加载

评论 #40833794 未加载

评论 #40833776 未加载

评论 #40833745 未加载

评论 #40833813 未加载

评论 #40834249 未加载

评论 #40833783 未加载

评论 #40833825 未加载

评论 #40833790 未加载

评论 #40833735 未加载

JonChesterfield11 months ago

I'll bet they don't consider the windows and office source code fair game for arbitrary reuse provided the other party found the copy on the web. Even if the person found the copy on GitHub.

beefnugs11 months ago

Isn't this discussion at all stupidly letting them control the goal posts? They have already gone far beyond this thinking that everything someone does on their own personal computer in their own home without the slightest bit of consent is going to be slurped up and recorded in case they want to query it someday.This is like arguing that this guy who just murdered someone 10 minutes ago, should actually be able to steal the candy from this child since the child put it down on the park bench.

starik3611 months ago

The more I read about this guy the more I get the feeling that he is an unscrupulous individual.robots.txt is a "grey idea" to him, instead of being a directive to keep moving? Wow.

mewpmewp211 months ago

What exactly is wrong with the statement he has made?

评论 #40833768 未加载

评论 #40833772 未加载

评论 #40833704 未加载

评论 #40833749 未加载

1vuio0pswjnm711 months ago

He compares fear of "AI" to fear of calculators. But "AI" cannot do math. Calculators do not "hallucinate". They are not correct "80%" of the time. They are correct 100% of the time. We know how they work. IIRC, in the 1970s someone at Bell Labs wrote a UNIX program that could generate fake academic papers. It might be a fun gag but it does it have much practical utility. No matter how "real" the papers might appear, or even if they are correct "80%" of the time, it is not an "invention", and it is certainly not comparable to a calculator.

avivallssa11 months ago

Will this make people who make indirect money through their content, less motivated from publishing their content on the Web ? This might be arguable.May be, there should be a similar amount of openness in publishing the content used for training commercial models.The copyright owner should have a privilege to ask for that content to be removed from training. This may also allow individual authors to gain their share with their Advanced RAG applications, that are specially focussed on the content they own and also published on the web.

评论 #40833504 未加载

评论 #40834030 未加载

评论 #40833654 未加载

29athrowaway11 months ago

One thing is a robots.txt policy, meant mostly for search crawlers.Another thing is the copyright of the content, terms of use policies, etc.Abiding by a robots.txt policy doesn't make you immune to copyright, terms of service, law in various jurisdictions, etc. If you think that you are probably a kleptomaniac.Just create a robots.txt with "User-Agent: one billion asterisks" so that the crawlers die when parsing it.

sircastor11 months ago

It seems obvious to me that there is no such thing as AI without publicly training on the open web, and that any kind of licensing is an impossible feat.Programs from my youth (Daria, Captain N) had licensed music for their broadcast, and that’s all because what else was ever going to be done? 20 years later, streaming with the music intact is an impossibility because the kind of money necessary to license all of it was too much. And you have to make deals with dozens of companies.Multiply that by several orders of magnitude and you start to see the scope of the problem.

评论 #40834018 未加载

评论 #40834005 未加载

sircastor11 months ago

Part of the problem here is that the web has gone through lots of change as to what it is and how people understand it.Some people think of it as billboards posted on the highway. Some think it’s a bulletin board. Some think it’s a newspaper. A television, a “zine”, a diary, graffiti. It has been all of these things, and is and isn’t. And people who publish are really bad at explicitly stating which one they are. But they expect you to know.

评论 #40839000 未加载

fimdomeio11 months ago

So we've now learned that copyright is determined by communications protocol. If you're using torrents it's copyright infringement, if it's the web then it's public domain.

评论 #40833721 未加载

评论 #40833780 未加载

boring-alterego11 months ago

Hmm hear me out, go to a public website and add black space below any video or picture with random adjectives that are your satire review of that piece of art then feed those into the ai model and tell it to ignore any text.

KoolKat2311 months ago

This is nothing but performative clickbait by the Verge.It is classified as fair use, the term is transformative use, where those using it are training models (their intention) if anyone wishes to Google it.The end.

评论 #40834050 未加载

whacko_quacko11 months ago

scraping the open web shouldn't be a crime[1], even if unsavoury people do it for unsavoury purposes[1]: or even just an issue

byyll11 months ago

It's not stealing content if the content is still in the original place. Stop trying to redefine words. It's copying.

评论 #40839003 未加载

93po11 months ago

If buying isn't owning, copying isn't stealing. This is a really tired argument.

cjk211 months ago

Ah yes the implied social contract that it's ok because it happens all the time.That's how society falls.

tiahura11 months ago

The open web's ethos since its inception in the 1990s has been one of unrestricted access and fair use. Content published openly online inherently invites broad consumption, reproduction, and creative reuse by the public. This is not merely custom, but a fundamental aspect of fair use doctrine as applied to the digital realm.The four factors of fair use - purpose of use, nature of the copyrighted work, amount used, and effect on the market - overwhelmingly favor allowing free use of openly published web content. The transformative nature of most reuses, the public availability of the original works, the necessity of using entire works in many cases, and the lack of a traditional market for such content all support this interpretation.This longstanding practice has been the catalyst for unprecedented innovation and information dissemination. It represents a tacit social contract between content creators and users, establishing a de facto "freeware" model for open web content. Any attempt to retroactively impose strict copyright limitations would not only stifle innovation but also contradict decades of established legal precedent and digital norms.-As a side note, I’m not certain that training necessarily involves “copying.”—-Lastly, if anyone really thinks the Robert’s court is going to knee-cap AI, you’re soft in the head.

评论 #40833739 未加载

评论 #40833707 未加载

评论 #40833869 未加载

19 comments

jimmaswell11 months ago

评论 #40834284 未加载

评论 #40834158 未加载

评论 #40840356 未加载

评论 #40836954 未加载

评论 #40843230 未加载

评论 #40834232 未加载

jsyang0011 months ago

评论 #40833787 未加载

评论 #40833794 未加载

评论 #40833776 未加载

评论 #40833745 未加载

评论 #40833813 未加载

评论 #40834249 未加载

评论 #40833783 未加载

评论 #40833825 未加载

评论 #40833790 未加载

评论 #40833735 未加载

JonChesterfield11 months ago

I'll bet they don't consider the windows and office source code fair game for arbitrary reuse provided the other party found the copy on the web. Even if the person found the copy on GitHub.

beefnugs11 months ago

starik3611 months ago

The more I read about this guy the more I get the feeling that he is an unscrupulous individual.robots.txt is a "grey idea" to him, instead of being a directive to keep moving? Wow.

mewpmewp211 months ago

What exactly is wrong with the statement he has made?

评论 #40833768 未加载

评论 #40833772 未加载

评论 #40833704 未加载

评论 #40833749 未加载

1vuio0pswjnm711 months ago

avivallssa11 months ago

评论 #40833504 未加载

评论 #40834030 未加载

评论 #40833654 未加载

29athrowaway11 months ago

sircastor11 months ago

评论 #40834018 未加载

评论 #40834005 未加载

sircastor11 months ago

评论 #40839000 未加载

fimdomeio11 months ago

So we've now learned that copyright is determined by communications protocol. If you're using torrents it's copyright infringement, if it's the web then it's public domain.

评论 #40833721 未加载

评论 #40833780 未加载

boring-alterego11 months ago

KoolKat2311 months ago

评论 #40834050 未加载

whacko_quacko11 months ago

scraping the open web shouldn't be a crime[1], even if unsavoury people do it for unsavoury purposes[1]: or even just an issue

byyll11 months ago

It's not stealing content if the content is still in the original place. Stop trying to redefine words. It's copying.

评论 #40839003 未加载

93po11 months ago

If buying isn't owning, copying isn't stealing. This is a really tired argument.

cjk211 months ago

Ah yes the implied social contract that it's ok because it happens all the time.That's how society falls.

tiahura11 months ago

评论 #40833739 未加载

评论 #40833707 未加载

评论 #40833869 未加载