GitHub scraped your code. And they plan to charge you

277 pointsby nocturealmost 4 years ago

34 comments

If you open-sourced code and allowed it to be used for commercial purposes, I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.(And charging for a product which adds value to your developer experience and needs money to be run is not a bad thing)

评论 #27724298 未加载

评论 #27724580 未加载

评论 #27724253 未加载

评论 #27724275 未加载

评论 #27724697 未加载

评论 #27724617 未加载

评论 #27724255 未加载

评论 #27724259 未加载

评论 #27725099 未加载

评论 #27724759 未加载

评论 #27731432 未加载

评论 #27724323 未加载

评论 #27731141 未加载

评论 #27724286 未加载

评论 #27724485 未加载

yangffalmost 4 years ago

So.. I can see that this ML model is generating some code exactly same as the original dataset, which definiately a problem. A defect model, sure. Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us. Even the private code, I mean like you join a company, you read their codebase, methodology and it becomes something yours. Copyrights generally not allow you to "copy" the original, but you can still synthesize your own code -- cutting, combination, creating based on whatever you have learnt. The method of how a ML model works is differ from human brain for sure, but I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.

评论 #27725285 未加载

评论 #27725280 未加载

评论 #27726055 未加载

评论 #27725388 未加载

nlhalmost 4 years ago

I have a genuine question about this whole thing with Copilot:A similar product, TabNine, has been around for years. It does essentially the exact same thing as Copilot, it’s trained on essentially the same dataset, and it gets mentioned in just about every thread on here that talks about AI code generation. (It’s a really cool product btw and I’ve been using and loving it for years). According to their website they have over 1M active users.Why is this suddenly a huge big deal and why is everyone suddenly freaking out about Copilot? Is it because it’s GitHub and Microsoft and OpenAI behind Copilot vs some small startup you’ve never heard of? Is it just that the people freaking out weren’t paying attention and didn’t realize this service already existed?

评论 #27725144 未加载

评论 #27725118 未加载

评论 #27725096 未加载

评论 #27725109 未加载

评论 #27725138 未加载

评论 #27725088 未加载

ineedasernamealmost 4 years ago

Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.Otherwise, Into honestly trying to have a conversation on this to understand the objections because I haven't made up my mind but struggle to see the problem. So pease consider the following:if the code was not encumbered by restrictions I don't see an obvious problem with this. Using code or data or anything like that in the public commons for a meta analysis doesn't strike me as wrong, even if the people doing it make money off of that analysis.If I scraped GitHub code and then wrote a book about common coding patterns & practices I don't think that would be wrong.I used the Brown corpus and multiple other written word corpuses (corpi?) Along with WordNet and other sources to write my thesis in Computational Linguistics Word Sense Disambiguation, later applying it to my job, which earns me money. Is this wrong?Public datasets have been used extensively for ML already. I don't see this as much different.

评论 #27724627 未加载

评论 #27724570 未加载

评论 #27724455 未加载

评论 #27725333 未加载

st_goliathalmost 4 years ago

> Hi. I know you’re excited about copilot.> ...> It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.Huh? Over the last few days that I've watched this "copilot" story unfold on various news aggregator sites, I've first seen people point out copyright and other issues with it, then the fast inverse square root tweet happened, and then more articles and tweets like this one and the discussion that we are currently having. But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it. Did I miss something?

评论 #27724816 未加载

评论 #27724808 未加载

评论 #27724850 未加载

评论 #27724792 未加载

评论 #27725322 未加载

andrewjlalmost 4 years ago

Here's the brutal and ugly truth: why isn't our personal data treated as private property? It's because those who write the laws governing its status either lack the requisite understanding or else practice a form of, to put it mildly, motivated reasoning.

评论 #27725455 未加载

seph-reedalmost 4 years ago

Well, for the most part my code isn't going to do anyone a ton of good. I don't use much in the way of popular frameworks, but I also guess this means I'm gonna be out of a job for not writing "normal" enough code at some point.Time to move on to the carbon age I suppose.

评论 #27724471 未加载

sfgalmost 4 years ago

Is there no licence with any sort of model training clause: "If this licence or the source code it covers is used to train a statistical model, then the model and code used to create the model are covered by this licence (which has terms like the AGPL)"?If not, will anybody quietly slip something like this into Copilot's training data?

greatgibalmost 4 years ago

I'm a big proponent of open source and I'm usually not nice with bad moves of GitHub. For example, i find stupid to use vscode and believe that it is open source when it is a lie.But, in that case, I think that the things that are put to charge GitHub are not right.I think that the idea is nice and it is fair from open source code. Anyone is free of downloading free software and doing something similar, and it is nice.I just find the product itself is stupid, and it is for users to be smart enough not to use that knowing that their is a risk of them being sued for involuntary violating copyright. And GitHub might be at risk if it is a paid service as the companies could sue them back by pretending that they expected the code generated by GitHub to be safe for commercial use.Also, I would think that GH would have abused if they used 'private repo' codes to train their model without permission.

评论 #27724658 未加载

maxbendickalmost 4 years ago

What's hilarious about auto-generating the GPL license is that it's provable Copilot is trained on GPL code, but it's essentially impossible to tell which code it came from. Any legal battle will be strange... Is it enough for Copilot to not regurgitate GPL licensed code exactly? Is it enough for Copilot to create a slightly modified version? Laughably, as soon as slight variation is added, there is so much code in the world that it'll be impossible to prove wrongdoing for HTML or JavaScript synthesis. A model trained on all permissively licensed code on GitHub looks a lot like your own GPL code? Are you sure your code is so unique?Microsoft of course will implement compliance standards as necessary (they genuinely do not want to break the law), but what does this mean for smaller companies and individuals training models?

eqtnalmost 4 years ago

Github should list all the projects they scrapped the code from to make copilot.

superkuhalmost 4 years ago

If you're hosting at the free github service, or even paid, github did not scrape your code. They just accessed the information on the hardware they owned. HTTP wouldn't have to be involved at all. They could just look at the disks.Additionally, "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy.""The above isn't to say I agree with this but just to highlight the dangers of outsourcing and the cloud.

评论 #27724888 未加载

评论 #27724893 未加载

评论 #27724928 未加载

yayralmost 4 years ago

see some analysis of the scope of this issue here: <a href="https://docs.github.com/en/github/copilot/research-recitation" rel="nofollow">https://docs.github.com/en/github/copilot/research-recitatio...</a>especially: Conclusion and Next Steps.This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.

评论 #27725676 未加载

bphoganalmost 4 years ago

Hi HN. Didn't really expect to see this tweet make it here of all places. But that's cool.

abetuskalmost 4 years ago

Here is the relevant portion of GitHub's terms of service (section D.4) [0]:"""4. License Grant to UsWe need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video...."""Note that the relevant detail is that this applies to public repositories not covered under some free/libre license. I also assume this excludes private repos which might have more restrictive terms of use. GitHub has a section on it I just haven't read it in detail and so maybe the above covers private repos as well.[0] <a href="https://docs.github.com/en/github/site-policy/github-terms-of-service#4-license-grant-to-us" rel="nofollow">https://docs.github.com/en/github/site-policy/github-terms-o...</a>

ChrisMarshallNYalmost 4 years ago

This is my take-home:> We are obsessed with shiny without considering that it might be sharp.

评论 #27725041 未加载

yayralmost 4 years ago

To me it seems, the whole subject requires additional consideration in licensing. It is a little like applying telephone based law to the internet. It will not 100% fit.If the creators interests are not clearly expressed anymore with a license, we need updates to the license texts.Let's look at MIT:____________________"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. [...]____________________From the license text alone, it would not be clear to me, why anyone could claim that the OpenAI codex or the Github Copilot would require attribution to any of the used MIT source code to generate the AI model. The AI model is simply not a copy of the source or of a portion thereof. It is essentially a mathematical / statistical analysis of it.Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?Where does the responsibility for attribution lie? Should we treat the AI code generation models like a copy & paste program? Usually you cannot really say where the copy came from 100% - how do you know what factors influenced it?

评论 #27725586 未加载

loklalmost 4 years ago

How does Copilot avoid training on malicious code? There might be bad actors who would love to have their code scraped for this...

评论 #27725530 未加载

评论 #27725490 未加载

coliveiraalmost 4 years ago

Newsflash: all open source means that you're already doing free work for the largest corporations in the world! It seems like developers, as a group, decided that it would be better to spend their nights writing free code for FAANG, so they would be able to keep their day jobs. Bezos and friends thank you all. #genius

haolezalmost 4 years ago

Source Hut[0] is getting more attractive with each passing day, but I'm not sure I can adapt to it's weird e-mail centric pull requests (and I know that this is a standard Git feature, but the UX seems bad).[0] <a href="https://sourcehut.org/" rel="nofollow">https://sourcehut.org/</a>

评论 #27724924 未加载

lmarcosalmost 4 years ago

A colleague of mine: "I either remove all my (useless) repositories from GitHub or I ask GitHub to pay me if they want to use my code in Copilot".It's not that crazy.

SergeAxalmost 4 years ago

> It’s truly disappointing to watch people cheer at having their work and time exploitedMaybe it's my information bubble, but I don't see anyone cheering. Currently Copilot churning out rather bad code. I am definitely would not use it. And my prediction about it that it will go like Tesla's autopilot for years.

mensetmanusmanalmost 4 years ago

Is this analogous to gtp-3 reading every sentence ever written without attribution to all of mankind?

评论 #27725250 未加载

ricardobeatalmost 4 years ago

I don’t understand this mentality. The AI is trained (or at least supposed to be - that’s fixable) on code that was published under open licenses. The “exploited by the man” trope after publishing OSS feels entirely backwards.

gdsdfealmost 4 years ago

Nothing is free people! ... People are outraged by GitHub but nobody is going after Facebook or Google for training their AIs on your personal data. Facebook used your face to train some algos, google your personal emails etc.

评论 #27724441 未加载

评论 #27724277 未加载

评论 #27724340 未加载

评论 #27724235 未加载

评论 #27724260 未加载

评论 #27724491 未加载

评论 #27724329 未加载

评论 #27725007 未加载

评论 #27724862 未加载

yongjikalmost 4 years ago

There may be discussions to be made about licenses, but "to watch people cheer at having their work and time exploited by a company worth billions" is a disappointingly myopic take, especially from a developer.Information that is aggregated and organized for easy retrieval is worth more than the sum of individual bits of information. I thought that was common sense.We might as well complain that billionaire supermarket chains are pocketing all the profit while not growing a single potato by themselves.

评论 #27724641 未加载

hekecalmost 4 years ago

On their website they say that "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set."So it won't copypaste your code. It had just read code from open sources and learned from it - similar to what humans do. So I don't see any problem with this.

评论 #27724835 未加载

评论 #27724694 未加载

评论 #27725020 未加载

评论 #27724953 未加载

skcalmost 4 years ago

I think the product is pretty cool, but I wish it had been announced by GitLab instead so there would be less of this brouhaha.

mrkrameralmost 4 years ago

Isn't Google's BigQuery also scraping GitHub and is making it accessible/available for commercial use.

tasubotadasalmost 4 years ago

God forbit somebody profits from the code that you've posted publicly.It's a NET POSITIVE FOR EVERYBODY.

评论 #27725675 未加载

评论 #27724812 未加载

Traubenfuchsalmost 4 years ago

> I have a SoundCloud and books and whatnot I could promote here.You just did.

stakkuralmost 4 years ago

Github is (owned by) Microsoft. This is just an appetizer.

tedunangstalmost 4 years ago

What if I don't pay?

评论 #27724860 未加载

einpoklumalmost 4 years ago

Hey, how come what is essentially the equivalent of an HN comment, only made on Twitter, gets to be an HN story? :-(

34 comments

iliekcomputersalmost 4 years ago

评论 #27724298 未加载

评论 #27724580 未加载

评论 #27724253 未加载

评论 #27724275 未加载

评论 #27724697 未加载

评论 #27724617 未加载

评论 #27724255 未加载

评论 #27724259 未加载

评论 #27725099 未加载

评论 #27724759 未加载

评论 #27731432 未加载

评论 #27724323 未加载

评论 #27731141 未加载

评论 #27724286 未加载

评论 #27724485 未加载

yangffalmost 4 years ago

评论 #27725285 未加载

评论 #27725280 未加载

评论 #27726055 未加载

评论 #27725388 未加载

nlhalmost 4 years ago

评论 #27725144 未加载

评论 #27725118 未加载

评论 #27725096 未加载

评论 #27725109 未加载

评论 #27725138 未加载

评论 #27725088 未加载

ineedasernamealmost 4 years ago

评论 #27724627 未加载

评论 #27724570 未加载

评论 #27724455 未加载

评论 #27725333 未加载

st_goliathalmost 4 years ago

评论 #27724816 未加载

评论 #27724808 未加载

评论 #27724850 未加载

评论 #27724792 未加载

评论 #27725322 未加载

andrewjlalmost 4 years ago

评论 #27725455 未加载

seph-reedalmost 4 years ago

评论 #27724471 未加载

sfgalmost 4 years ago

greatgibalmost 4 years ago

评论 #27724658 未加载

maxbendickalmost 4 years ago

eqtnalmost 4 years ago

Github should list all the projects they scrapped the code from to make copilot.

superkuhalmost 4 years ago

评论 #27724888 未加载

评论 #27724893 未加载

评论 #27724928 未加载

yayralmost 4 years ago

评论 #27725676 未加载

bphoganalmost 4 years ago

Hi HN. Didn't really expect to see this tweet make it here of all places. But that's cool.

abetuskalmost 4 years ago

ChrisMarshallNYalmost 4 years ago

This is my take-home:> We are obsessed with shiny without considering that it might be sharp.

评论 #27725041 未加载

yayralmost 4 years ago

评论 #27725586 未加载

loklalmost 4 years ago

How does Copilot avoid training on malicious code? There might be bad actors who would love to have their code scraped for this...

评论 #27725530 未加载

评论 #27725490 未加载

coliveiraalmost 4 years ago

haolezalmost 4 years ago

评论 #27724924 未加载

lmarcosalmost 4 years ago

A colleague of mine: "I either remove all my (useless) repositories from GitHub or I ask GitHub to pay me if they want to use my code in Copilot".It's not that crazy.

SergeAxalmost 4 years ago

mensetmanusmanalmost 4 years ago

Is this analogous to gtp-3 reading every sentence ever written without attribution to all of mankind?

评论 #27725250 未加载

ricardobeatalmost 4 years ago

gdsdfealmost 4 years ago

评论 #27724441 未加载

评论 #27724277 未加载

评论 #27724340 未加载

评论 #27724235 未加载

评论 #27724260 未加载

评论 #27724491 未加载

评论 #27724329 未加载

评论 #27725007 未加载

评论 #27724862 未加载

yongjikalmost 4 years ago

评论 #27724641 未加载

hekecalmost 4 years ago

评论 #27724835 未加载

评论 #27724694 未加载

评论 #27725020 未加载

评论 #27724953 未加载

skcalmost 4 years ago

I think the product is pretty cool, but I wish it had been announced by GitLab instead so there would be less of this brouhaha.

mrkrameralmost 4 years ago

Isn't Google's BigQuery also scraping GitHub and is making it accessible/available for commercial use.

tasubotadasalmost 4 years ago

God forbit somebody profits from the code that you've posted publicly.It's a NET POSITIVE FOR EVERYBODY.

评论 #27725675 未加载

评论 #27724812 未加载

Traubenfuchsalmost 4 years ago

> I have a SoundCloud and books and whatnot I could promote here.You just did.

stakkuralmost 4 years ago

Github is (owned by) Microsoft. This is just an appetizer.

tedunangstalmost 4 years ago

What if I don't pay?

评论 #27724860 未加载

einpoklumalmost 4 years ago

Hey, how come what is essentially the equivalent of an HN comment, only made on Twitter, gets to be an HN story? :-(