There has (quite rightly) been a lot of discussion on the internet about whether training Copilot on non-permissively licensed open source code is fair. There are similar debates to be had about training DALL-E and others on artworks.<p>For software, my feeling is that this is very much a grey area, so why not make it explicit?<p>Should we develop licences that are crystal clear around whether you are permitted to use this codebase for the purposes of training models?<p>(I vote 'yes')
The ambiguity makes sense here. It sounds like you're asking if people should be forbidden from training text models on the Linux kernel; I think this line is already drawn in the sand with GPL. If someone trains an AI model on the Linux kernel and use it to generate commercial code, that's a potential license violation.<p>The lack of clear signalling around this topic leaves it ripe for OSS devs and enthusiasts to explore, but extremely scary for commercial entities to navigate. In other words, it's working exactly as GPL should.
No. Either such training is fair use, or it isn't. If it is fair use, then it's always allowed even if the license explicitly says it's not. If it isn't fair use, then Microsoft is already violating the licenses anyway, such as the GPL, by not making Copilot's source available under the same license (and ditto for things like DALL-E, and also by violating the attribution clauses even of permissive licenses).