If it was possible to point to where the code is actually stored in Copilot (i.e. run strings on the server and it spits out the copylefted code), I would be a lot more sympathetic to the view that LLMs are stealing code. Even if you decode the weights from floats to strings, you won't find the strings of code stored anywhere. It's probably not correct to say it's "learning" from code the same way humans are, but it <i>is</i> "learning" from code in the way LLMs learn.<p>Fundamentally it seems to me that Copilot truly does synthesise code from some store of knowledge (even if it's hard to understand what this store of knowledge is), and the problem is that it's synthesising code identical to existing code. There are legal tools and also rhetoric that are designed for dealing with this problem of "synthesising something that is identical to an existing thing", and it's <i>different</i> tools and rhetoric from the ones we have for dealing with the problem of "stealing or copying existing things". It's valid to have an issue with with Copilot ingesting your code, but unfortunately people are largely using tools and rhetoric from the latter category to approach the issue, and that slight misapplication is causing their issues to fall on deaf ears a lot of the time.
I'm really baffled by all this discussion on copyrights in the age of AI. The Copilot does not
'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it. IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas.
> It means that they have the right to share the code of others on GitHub, as long as they respect the terms of license. This is totally legal. But then, Copilot will be able to analyze the code and violates the license terms, which isn’t.<p>Both claims here are incorrect even though pretty much everyone gets this wrong. When someone who has obtained the right to redistribute some code only under the GPL uploads it to GitHub, that person (and that person only) violates the terms of the GPL. The GPL requires further redistribution only under its own terms, but uploads to GitHub come with a grant of a too-permissive license that a GPL licensee does not have the right to grant.<p>When GitHub proceeds to use the uploaded code to train copilot, they (probably) are abiding by the terms of this new license they have been (fraudulently) granted. They are not bound by the GPL, that's not how licenses work: they've got the other one. Now, GitHub has a big weakness here which is that they ought to know they're being granted licenses that the putative licensors have no right to grant. But that still would not make them in violation of the GPL, just of the original copyright.
Let's leave copilot/AI aside for a moment.<p>Do you actually have the necessary rights to upload some else's code to github?<p>When you upload to github, you give it special rights not merely for redistribution and CI stuff.<p>You give Github the rights to use the code for other github projects. That alone might not be compatible with some licenses (think GPL virality).<p>So if your software is BSD or anything without attribution I can probably upload it without problems.<p>But if your license requires so much as attribution, can I give Github the rights to use the code for any other internal project they might have?<p>Remember that in Github TOS you give grants for <i>any</i> github service, and in some special cases it requires this to be without attribution.<p>IANAL but I think I lack the necessary rights here, regardless of copilot.
I understand the sentiment, but I think it is misguided and therefore - counterproductive.<p>First, LLMs learn patterns, not just copy and paste. If they generate verbatim copies of any non-trivial-enough part that would be the subject of a copyright license, yes, it would be copyright infringement. Yet, could anyone give such practical examples? And if so, how do they differ from a software engineer who copies and pastes code?<p>Second, if the code is hosted anywhere else, there is no guarantee that Copilot (or another model) won't learn from that. The only way to make sure no one and nothing will learn from open-source code is to make it as closed as possible.<p>Third, for me, the crucial part of open-source code is maintenance. GitHub is there and works well both as a platform for creation (I consider GitHub the most productive social network) and an archive. "No GitHub" (even as a mirror) means that the code is likely to be stored in places less likely to engage collaborators and less likely to last long.
I get the sentiment, but people can and should do whatever you permit them to in your license. If you don't want your code hosted in one place, say so in your license.
Due to their TOS I believe it's illegal to upload any code you don't have full rights for to GitHub (if it wasn't already uploaded by someone who has full rights for it, which isn't the same as it already being on github).<p>This is even true for e.g. MIT licenses as while they are very permissive they still require a form of attributions which Copilot doesn't provide.<p>Also for anyone arguing that such ML models "lern", at least the moment they are not exatcly perfect sized or sized below that they are basically guaranteed to verbatime encode partial copies of code, it's fundamental consequence of their design. And while this encoding is "encoded" in some way it's not transformative in the definition of copyright law AFIK. I.e. any big model net is guaranteed to commit hard to trace copyright infringement.
How has it come to that? The tech industry has declared in turn that privacy is dead, that labor contracts are dead, that tax obligations are dead and now that copyright is dead. Why is the most promising tool towards a better society appropriated in such a destructive manner?<p>People must separate their fascination with tech as such, from the predominant tech business models that are basically - you can't put lipstick on a pig - <i>parasitic</i>.
the problem with this approach is that humans suck and this is an invitation for trolls to upload your code to github. i don't even know how you're supposed to solve this.<p>you could proactively scan github for your code and try to get them to purge it if you find it i suppose, if that would remove your code from copilot. but even that is not a great solution because you would need to prove you're the actual author, and github would probably need to be involved in building a mechanism to do so, but they don't give a shit.<p>i think the reality is the LLMs have eroded copyright protections and trying to fight it isn't likely to pay off
I don't understand why people freak out about their code being used by others when they publicly release it. Is it lack of attribution? Does attribution really do anything if your not already famous? If you don't want your code known, don't make it open-source.
It's ironic that the movie studios want license agreements for content they own to cover distinct uses (streaming vs DVD's for example), yet they expect actors to agree to allow their voice, likeness, mannerisms, etc. to be usable by an AI for future projects (only the studio's own projects of course!).<p>If I own "the thing", I want you to pay for each new, distinct kind of use for it. But if it's your thing I want to use, I want you to have a permissive license.
The ship has sailed and the cat is out of the bag. By the time any of this shakes out everyone will be using copilot and the lawyers will have ground any argument into dust. Good luck putting it back in the bag.<p>After you have cleaned everything off of GitHub and separated yourself from the ecosystem it (AI bots) will be everywhere and your code gobbled up again.<p>The only thing that is worth a fuck is a working product. Not your boilerplate or fancy snippets that you want to claim ownership over so as to stop the evil Microsoft from benefitting the community.<p>Are we devs or luddites?
GPL and AGPL allows you to add more clauses to the license if they don't clash with any existing ones. The FSF should certainly share some guidelines on how we can include and extend the license through an additional "anti-ML / anti-AI" clause.
This is misguided. What you want is adding a special clause to your license that disallows usage for training LLMs. Whether the code is on GitHub or not, it’ll be used to train models if it’s publicly available and the license allows it.
Ah yes. More "Free" as in "Do As You're Told" from people who call their source more open than anyone else's. Coming up on a decade later, and if anything we have more to learn: <a href="https://marktarver.com/free-as-in-do-as-your-told.html" rel="nofollow">https://marktarver.com/free-as-in-do-as-your-told.html</a>
I think of large models, ones such as CoPilot, as lossy compression with content addressable retrieval. If you type the first few parts of some content it has stored then it will retrieve the rest for you.<p>The blocks retrieved are very small and many of them occur frequently. “if” followed by “(“ for example — hardly worthy of copyright, but we also know that they were literally taken from copyrighted material.<p>(I don’t think the model starts out with any existing knowledge of syntax / grammar of, say, Python?)<p>Even if some of that material was public domain, a lot of it wasn’t and at best requires attribution; at worst, full licensing conditions.<p>To put it another way: it doesn’t matter how many vegetable ingredients they throw into the sausage or how elaborate the sausage making machine is: if they put pork in, the stuff that comes out ain’t vegetarian.
> But then, Copilot will be able to analyze the code and violates the license terms, which isn’t.<p>I'm not sure about this. In most cases the result would be similar to a human reading a lot of opensource and later when writing use patterns that they'd learned. It's only in the edge cases where there's clear 'plagiarism' on a niche prompt that it would be problematic. An more direct solution isn't to take everything off GitHub, but rather to not allow Copilot to do near-literal copy/paste.<p>If we moved opensource to BitBucket, there's no protection that it wouldn't do the same as Copilot. Attack the problem directly.<p>A way to think of this banner is that of signing a publicly visible petition to make Copilot behave as humans abiding to licenses do.
> Even if a project is not hosted on GitHub, other people have the legal right (depending on the license) to redistribute the source code. It means that they have the right to share the code of others on GitHub, as long as they respect the terms of license. This is totally legal. But then, Copilot will be able to analyze the code and violates the license terms, which isn’t.<p>While encouraging people to not distribute code via Github may mitigate the issue some, the actual issue is how Github has mass-automated the process of violating open source licenses. Github should pay a fine for every suggestion Copilot produces that violates a software license, plain and simple. Don't blame the people that unknowingly upload code to the training dataset.
The business model for disruptive "innovation":<p>Weasel wording to evade legislation. It's not an unlicensed taxi, it's ride sharing. It's not an illegal hotel, it's couch surfing. It's not code licence infringement because it's learning.<p>It's lawyers finding loopholes for finance to avoid expenses that gives them an edge over the poor suckers who play by the rules. The tech is just a tool to this purpose.<p>And most people forget TANSTAAFL. The costs for the cars and their infrastructure, the load tourists put on a place, and the effort for writing the code are still there. The "innovators" just found a way to make somebody else pay for what should be their cost center.
Why HN is so defensive against copilot but have a completely different opinion about MidJourney / StableDiffusion? Both are generative softwares which when generate a piece verbatim only means over fitting / over training on that particular example.<p>The tone on one of these tools is hypocritical. When it comes to digital art general sentiment is that it's inevitable and artists need to up their game. This sentiment is not being repeated for code generators.
While I sympathise with the remarks, I think these arguments are wrong.
In any programming language there are only that many ways to write something. For instance, iterate over a list of items and do some processing to them.
So when someone publishes their code under GPL, suddenly millions of projects that come up with exactly the same code can be in a violation?
I get this is complex, but copyrighting code is akin to copyrighting maths formulas.
Imagine if a company selling bottled water could also copyright water. That's very much what is happening.
It sucks that people will be out of commission once AI can do the same work basically for free. For years I have always felt that with more tax and a sort of “freedom fund” more people will be able to make “free” stuff in public for the global good. The only way to get there is to stop paying for stuff such that we eventually get to a utopian future as seen in Star Trek. Although the intervening period would be chaotic. For example, we pay tolls on some roads but we don’t pay cash every time we want to walk down the street - that idea sounds crazy. Likewise, if we could support the work of creators with UBI then we could all enjoy free software, media, books, etc. I know I would keep creating, because I love it. But for now the middleman wants to get his share so it seems sensible that we must pay companies for their produce directly.<p>To be even more extreme, a _lot_ of people think if anything is public, it is fair gain to consume and reuse. If you didn’t have to worry about getting paid, would you care about someone enjoying and using your work? Perhaps the opposite - that would be your motivation.
One relatively simple answer is to just add to the license the term that it nor its derivative works can not be uploaded to Github or be used to train any AI. It will be wildly breached because no one cares about the license but it does at least give you some legal recourse if you choose to pursue it. You aren't required to just use GPL V2-3 you can amend the terms however you wish.
Isn't e.g. ChatGPT trained on data all over the internet? That means even if you upload your code to GitLab or your own public Gitea instance, it will still be used to train AIs?!<p>I don't see the point of this because quite frankly - if you want to prevent AIs using your data, you already lost that battle in the moment you uploaded your Code to the publicly accessible internet.
Where does OpenAI state that they only train ChatGPT-3.5 or GPT-4 on code from GitHub? The model for GitHub Copilot X clearly has a (human) language understanding that you can't get from source code (or source code comments), so they are trained on much more data than GitHub has and there is no reason to believe OpenAI would limit themselves to that.
Please upload all open source code to everywhere that will accept it. Make everything indestructible. Make it as easy as possible to find an authentic copy. Stick every bit of code with license that allows it in every free <i>and</i> proprietary neural network.<p>Did you write the absolute best implementation of X? I want to see it everywhere. Everywhere. In every single place where X is needed or discussed. Where all fine X are sold. Don't you? Or do you genuinely want to narrow the people who see your impl by some signifnicant percentage because you got frustrated with the ugly capitalism of a recent distribution mechanism?<p>If I invented Golden rice[0] I'd like to think I'd allow it to be sold at Panda Express and all the other evil capitalist chains, not just the local co-op whose business practices I prefer.<p>0. <a href="https://en.wikipedia.org/wiki/Golden_rice" rel="nofollow">https://en.wikipedia.org/wiki/Golden_rice</a>
Copilot and other code-authoring LLMs are one of the biggest innovations in software engineering in recent years. I use it daily and can't imagine going back to work without it — adopting it in 2021 was the same shock of a productivity boost as when I learned vim.<p>Yes, I know that it can occasionally break license by producing licensed code verbatim. But in my almost 2 years of using it daily I have never seen it happen first-hand, and I don't see how this licence infringement could actually do any significant damage to anyone — so while I acknowledge that this problem exist, I refuse to accept that it's as significant enough as people make it out to be.<p>For a long time, it seemed like technical progress in computing have stopped, and now that AIs and LLMs are finally bringing exciting new technology to life, it's very sad to see exactly the people that should be excited and inspired about it — software engineers — fighting against it.
As far as I know, rights and licenses don't work with suggestions like "please". Either embed it in your COPYING/LICENSE files so the initial 'uploading to github' action would be illegal, or take legal action agains the main offender (which, in this case GitHub).
Of course this isn't just a problem with copyleft licenses on github, but also with non-open code on github. Only there the problem may be less visible.<p>Ideally, github should check the license of the code it's using to feed copilot, and only use code with permissive licenses.
What is stopping Github from crawling whatever other Open source platform you choose?<p>I don't think this will fix anything.<p>Actually developers are the only ones that stand to lose here since now open source will be spread on multiple platforms, making it harder to find what you want
Don't share and license your code permissively then.<p>People want the benefits of publicly sharing stuff, but then they want to prohibit others from learning from what they share.<p>There are many options to keep things private. The downside is that you won't get the same exposure.
While I understand the author's concern, it seems a bit naïve to assume that Microsoft would ONLY learn from code in GitHub. Geek blogs are spectacular places to ingest code from because they give a ton of context and explanations for what's going on. Way more than raw source code ever does. If you want an AI to understand what's happening in code via the English phrases you use to describe what you want then you _need_ to train it on code that has similar English phrases describing what it's doing.<p>Ingesting from public sources outside of GitHub will just become more necessary as they work to improve these things.
Honestly I think we should explicitly forbid using the material as training data in our Foss licenses.
Unless the weights and the network model are made public I dont want any of my code to contribute to such an AI.
Naomi Klein's recent essay about AI^[1] suggests that the real "hallucination" problem in AI is that it's promoters are not seeing the real world clearly. Some of her points about the effect of AI rollout on human employment may be germane to the present discussion.<p>1. <a href="https://www.theguardian.com/commentisfree/2023/may/08/ai-machines-hallucinating-naomi-klein" rel="nofollow">https://www.theguardian.com/commentisfree/2023/may/08/ai-mac...</a>
I'm having a hard time buying that people are genuinely worried about their code being copied without proper attribution. While it is possible for CoPilot to generate copyrighted code, this typically occurs only under intentional efforts and only for a few lines of code. It's just not an actual issue.<p>And something tells me that even if CoPilot would be entirely prevented from doing that, they would still not be happy about CoPilot using their code for training. The copyright issue is just a convenient pretext.
This seems like a very reasonable position to me, but they should add the restriction to their licensing rather than just nicely asking other devs to pretty-please don't do this.<p>Although I don't use github (for reasons unrelated to to copilot), public access to my code was eliminated when I took my websites down while I look for a way to deal with AI scraping. I'm eagerly watching what others do, hoping that someone will have a great idea of how to deal with this before I do.
This assumes that GitHub Copilot only gets its data from GitHub and that GitHub Copilot is unique among tools.<p>Remember the old saying: Anything published on the internet stays on the internet.<p>What prohibits someone from crawling code from other sites and building a GitHub Copilot equivalent?<p>Considering how ChatGPT style bots are often trained on public websites that is likely to already be true even.
I think the concept of open source should simply extend to AI models.
I'm fine with AI being trained on open source code, if the entire model is also released as open source.
We need GPL analog to AI training - license that allows you training models, but only if they are released as open source. An infective AI training license.
This will always be an issue as long as people can fork the code, so one might say we need a license that prevents a module from being used in ML training, better yet, we need a way for a commented line or something that'd pork the training pipeline if found in the source training, and removing it would violate the license.
Hmm a thought.<p>With neural networks it's impossible to actually describe what the network does. Including proving that it has or has not used GPLed code for a certain input/output set.<p>One could argue that all code output is GPL with the associated restrictions unless said network has provably never been trained on GPLed code.
Seems like Github should train one LLM per license type. There'd be a GPL LLM that's trained on GPL code, an MIT LLM trained on MIT licensed code, etc. Then Copilot users could select the LLM for a license appropriate for the work they're doing.
That's an interesting PoV. I hadn't thought of it.<p>But I don't know how much success the author will have, in his endeavor.<p>The horses have fled. Closing the barn doors, does nothing. The Rubicon has been crossed. The die is cast. <i>Iacta alea est</i>, etc.
You can say anything you want in your licence, I guarantee you people will break it.<p>If you don't want your code public, don't make it so. Because licenses are a thing of trust and you shouldn't trust anybody.
It will not change anything. AI farms will git-clone from any public repository. Someone concerned should just patent-troll any software that may or may not use a copilot-ed part of their code.
You could make this into an addendum of whatever license you want no? That way your license includes both reuse and attribution and also, you are not allowed to upload to GitHub.
<p><pre><code> *.code-workspace
.github/
.vscode/
</code></pre>
in your .gitignore is another way to express that you don’t want Microsoft as a part of your project.
(apologies for the sarcasm, but)<p>According to the Hitchhikers Guide to the Galaxy…<p>A new coding highway was planned years ago. You could have filed a complaint in the Implications of Future AI Department in the basement of Bill Gates’ mansion, but you didn’t.<p>The highway is being constructed and you’re just going to have to deal with it.<p>Here’s some Vogon poetry to make you feel worse and a towel to cry in (only one of its many uses).<p>The guide also suggests gardening as a way to reduce anxiety.
There should be a new open-source license specifically designed to restrict the use of code for commercial purposes, whether it involves training machine learning models or not. It could be similar to GPL but tailored specifically for the field of machine learning, ensuring that the code cannot be utilized for ML commercial purposes or trained without limitation.
Copyright laws do not apply.
It’s trained intelligence, in a similar way as human intelligence.
If you apply copyright laws to LLM you should apply them to human intelligence too as it’s the same process (with a different scale)<p>Yes programming is going away, so is most intellectual and artistics tasks.
Read through the github "problems" following the link and it reads as "for profit organisation makes for profit tools"... Great.<p>I think github, after stackoverflow, is the best thing that happened to developers.