If you open-sourced code and allowed it to be used for commercial purposes, I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.<p>(And charging for a product which adds value to your developer experience and needs money to be run is not a bad thing)
So.. I can see that this ML model is generating some code exactly same as the original dataset, which definiately a problem. A defect model, sure.
Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us. Even the private code, I mean like you join a company, you read their codebase, methodology and it becomes something yours. Copyrights generally not allow you to "copy" the original, but you can still synthesize your own code -- cutting, combination, creating based on whatever you have learnt.
The method of how a ML model works is differ from human brain for sure, but I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????<p>And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.
I have a genuine question about this whole thing with Copilot:<p>A similar product, TabNine, has been around for years. It does essentially the exact same thing as Copilot, it’s trained on essentially the same dataset, and it gets mentioned in just about every thread on here that talks about AI code generation. (It’s a really cool product btw and I’ve been using and loving it for years). According to their website they have over 1M active users.<p>Why is this suddenly a huge big deal and why is everyone suddenly freaking out about Copilot? Is it because it’s GitHub and Microsoft and OpenAI behind Copilot vs some small startup you’ve never heard of? Is it just that the people freaking out weren’t paying attention and didn’t realize this service already existed?
Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.<p>Otherwise, Into honestly trying to have a conversation on this to understand the objections because I haven't made up my mind but struggle to see the problem. So pease consider the following:<p>if the code was not encumbered by restrictions I don't see an obvious problem with this. Using code or data or anything like that in the public commons for a meta analysis doesn't strike me as wrong, even if the people doing it make money off of that analysis.<p>If I scraped GitHub code and then wrote a book about common coding patterns & practices I don't think that would be wrong.<p>I used the Brown corpus and multiple other written word corpuses (corpi?) Along with WordNet and other sources to write my thesis in Computational Linguistics Word Sense Disambiguation, later applying it to my job, which earns me money. Is this wrong?<p>Public datasets have been used extensively for ML already. I don't see this as much different.
> Hi. I know you’re excited about copilot.<p>> ...<p>> It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.<p>Huh? Over the last few days that I've watched this "copilot" story unfold on various news aggregator sites, I've first seen people point out copyright and other issues with it, then the fast inverse square root tweet happened, and then more articles and tweets like this one and the discussion that we are currently having. But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it. Did I miss something?
Here's the brutal and ugly truth: why isn't our personal data treated as private property? It's because those who write the laws governing its status either lack the requisite understanding or else practice a form of, to put it mildly, motivated reasoning.
Well, for the most part my code isn't going to do anyone a ton of good. I don't use much in the way of popular frameworks, but I also guess this means I'm gonna be out of a job for not writing "normal" enough code at some point.<p>Time to move on to the carbon age I suppose.
Is there no licence with any sort of model training clause: "If this licence or the source code it covers is used to train a statistical model, then the model and code used to create the model are covered by this licence (which has terms like the AGPL)"?<p>If not, will anybody quietly slip something like this into Copilot's training data?
I'm a big proponent of open source and I'm usually not nice with bad moves of GitHub. For example, i find stupid to use vscode and believe that it is open source when it is a lie.<p>But, in that case, I think that the things that are put to charge GitHub are not right.<p>I think that the idea is nice and it is fair from open source code. Anyone is free of downloading free software and doing something similar, and it is nice.<p>I just find the product itself is stupid, and it is for users to be smart enough not to use that knowing that their is a risk of them being sued for involuntary violating copyright. And GitHub might be at risk if it is a paid service as the companies could sue them back by pretending that they expected the code generated by GitHub to be safe for commercial use.<p>Also, I would think that GH would have abused if they used 'private repo' codes to train their model without permission.
What's hilarious about auto-generating the GPL license is that it's provable Copilot is trained on GPL code, but it's essentially impossible to tell which code it came from. Any legal battle will be strange... Is it enough for Copilot to not regurgitate GPL licensed code exactly? Is it enough for Copilot to create a slightly modified version?
Laughably, as soon as slight variation is added, there is so much code in the world that it'll be impossible to prove wrongdoing for HTML or JavaScript synthesis. A model trained on all permissively licensed code on GitHub looks a lot like your own GPL code? Are you sure your code is so unique?<p>Microsoft of course will implement compliance standards as necessary (they genuinely do not want to break the law), but what does this mean for smaller companies and individuals training models?
If you're hosting at the free github service, or even paid, github did not scrape your code. They just accessed the information on the hardware they owned. HTTP wouldn't have to be involved at all. They could just look at the disks.<p>Additionally, "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy.""<p>The above isn't to say I agree with this but just to highlight the dangers of outsourcing and the cloud.
see some analysis of the scope of this issue here:
<a href="https://docs.github.com/en/github/copilot/research-recitation" rel="nofollow">https://docs.github.com/en/github/copilot/research-recitatio...</a><p>especially: Conclusion and Next Steps.<p>This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.<p>But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.<p>The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.<p>This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
Here is the relevant portion of GitHub's terms of service (section D.4) [0]:<p>"""<p>4. License Grant to Us<p>We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.<p>...<p>"""<p>Note that the relevant detail is that this applies to public repositories not covered under some free/libre license. I also assume this excludes private repos which might have more restrictive terms of use. GitHub has a section on it I just haven't read it in detail and so maybe the above covers private repos as well.<p>[0] <a href="https://docs.github.com/en/github/site-policy/github-terms-of-service#4-license-grant-to-us" rel="nofollow">https://docs.github.com/en/github/site-policy/github-terms-o...</a>
To me it seems, the whole subject requires additional consideration in licensing. It is a little like applying telephone based law to the internet. It will not 100% fit.<p>If the creators interests are not clearly expressed anymore with a license, we need updates to the license texts.<p>Let's look at MIT:<p>____________________<p>"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:<p>The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
[...]<p>____________________<p>From the license text alone, it would not be clear to me, why anyone could claim that the OpenAI codex or the Github Copilot would require attribution to any of the used MIT source code to generate the AI model. The AI model is simply not a copy of the source or of a portion thereof. It is essentially a mathematical / statistical analysis of it.<p>Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?<p>Where does the responsibility for attribution lie? Should we treat the AI code generation models like a copy & paste program? Usually you cannot really say where the copy came from 100% - how do you know what factors influenced it?
Newsflash: all open source means that you're already doing free work for the largest corporations in the world! It seems like developers, as a group, decided that it would be better to spend their nights writing free code for FAANG, so they would be able to keep their day jobs. Bezos and friends thank you all. #genius
Source Hut[0] is getting more attractive with each passing day, but I'm not sure I can adapt to it's weird e-mail centric pull requests (and I know that this is a standard Git feature, but the UX seems bad).<p>[0] <a href="https://sourcehut.org/" rel="nofollow">https://sourcehut.org/</a>
A colleague of mine: "I either remove all my (useless) repositories from GitHub or I ask GitHub to pay me if they want to use my code in Copilot".<p>It's not that crazy.
> It’s truly disappointing to watch people cheer at having their work and time exploited<p>Maybe it's my information bubble, but I don't see anyone cheering. Currently Copilot churning out rather bad code. I am definitely would not use it. And my prediction about it that it will go like Tesla's autopilot for years.
I don’t understand this mentality. The AI is trained (or at least supposed to be - that’s fixable) on code that was published under open licenses. The “exploited by the man” trope after publishing OSS feels entirely backwards.
Nothing is free people! ... People are outraged by GitHub but nobody is going after Facebook or Google for training their AIs on your personal data. Facebook used your face to train some algos, google your personal emails etc.
There may be discussions to be made about licenses, but "to watch people cheer at having their work and time exploited by a company worth billions" is a disappointingly myopic take, especially from a developer.<p>Information that is aggregated and organized for easy retrieval is worth more than the sum of individual bits of information. I thought that was common sense.<p>We might as well complain that billionaire supermarket chains are pocketing all the profit while not growing a single potato by themselves.
On their website they say that "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set."<p>So it won't copypaste your code. It had just read code from open sources and learned from it - similar to what humans do. So I don't see any problem with this.