A client of mine is just about to launch a startup. There's nothing public on the web yet. Today, while hacking a very minimal prototype of a HTML page into VSCode, I got a strange suggestion from GH Copilot. When I entered the client's name Copilot prompted me with a ready-made <article> block containing a marketing claim of their company. So far so nice.<p>Strange thing is the claim is not public and not contained in the codebase I'm currently working on, in no codebase I know of and to my knowledge, I'm the only related dev using Copilot. It's also not listed on Google, thus I assume it's not leaked somewhere else that could be indexed by GH (which is an assumption of course, but appears likely). But it can be found in a completely separate local folder with project assets that is not published on GH.<p>The marketing claim is about the length of a tweet and is not exactly generic. It requires understanding of my client's business (which again, cannot be derived from my codebase). So it's not GPT3 output that matches coincidentially.<p>The GH „About GitHub Copilot Telemetry” page [1] does not indicate that your locale file system is scanned though.<p>Can anyone explain that or observed a similar phenomenon?<p>[1] https://docs.github.com/en/github/copilot/about-github-copilot-telemetry
Yes, other files open in your IDE may also be scanned.<p>From terms of service [1] (Which I'm sure everyone reads)<p>> when you edit files with the GitHub Copilot extension/plugin enabled, file content snippets [...] will be shared with GitHub, Microsoft, and OpenAI, and used for diagnostic purposes to improve suggestions and related products. GitHub Copilot relies on file content for context, both in the file you are editing and potentially other files open in the same IDE instance.<p>[1] <a href="https://docs.github.com/en/github/copilot/github-copilot-telemetry-terms" rel="nofollow">https://docs.github.com/en/github/copilot/github-copilot-tel...</a>
I'll second the plausible suggestion that perhaps the name/marketing copy combo isn't as unpredictable as you might think. Corporate speak and company names are pretty formulaic. Try running the name and <article> through GPT-3 and see what happens (or GPT-2 here: <a href="https://bellard.org/textsynth/" rel="nofollow">https://bellard.org/textsynth/</a>)<p>E.g., I just prompted GPT-2 with a made up company name that doesn't have any google search results and got a completion like this:<p><p><em>Fully featured webapp for social &amp; mobile networks in the cloud</em></p>
Well this could be a huge security issue. Can lead to potentially Copilot-surfing for company secrets in a new form, since Copilot is already leaking secret API keys and copyrighted code.<p>The dangers of just regurgitating what has been read are unreal, since with good enough targeting you can read the data someone else wrote and expected to be anonymized. It's like huge global RAM of code, you just need to figure out how to get it to point at the right addresses.
Assuming you're on Windows, you can see all of a process's IO using a tool like Process Monitor: <a href="https://docs.microsoft.com/en-us/sysinternals/downloads/procmon" rel="nofollow">https://docs.microsoft.com/en-us/sysinternals/downloads/proc...</a><p>FYI: It's a firehose, but you should be able to filter it down to copilot and a path prefix. Then you'll know if it's being scanned.
I noticed a very peculiar thing with copilot.<p>I was writing a twitter api wrapper and had hardcoded access token for a test user. Copilot when testing a function that requires a user id suggested one. I searched the id and it belonged to my test user. I searched my entire codebase but couldn't find any place I had used the particular id. The only place it could have extracted it from would have been access token which has user id as part of the string(which I hadn't noticed before this). Either this is a common code pattern or I don't know how to process this.
Do the following experiment:<p>1) Put in the same folder where you had that marketing blurb, a new blurb. Make sure it is unique but still looks like english. Example: now your startup allows you to fly to the moon in rockets powered by angry tweets.<p>2) Try to force a reload of copilot. Maybe reinstall it?<p>3) Recreate the conditions that suggested the first blurb, but trying to suggest the second blurb.<p>4) Share the results, I am curious.<p>If you get that suggestion, you have proof and reproductible steps for others to get proof. If you don't get that suggestion, we can't be sure, but the odds of it being just a coincidence increase.<p>Good luck!
I think it also reads your clipboard. Yesterday, I had copied something from stackoverflow, and was about to paste it, and it gave me a suggestion before I could even drop it in.
How exactly does Copilot not open Microsoft up to significant legal liability, when it has been demonstrated that copilot will regurgitate entire blocks of scanned code?
Alex Graveley, the Chief Architect at GitHub on Copilot, says pretty definitively here that it does not look at anything outside your project: <a href="https://twitter.com/orph/status/1457790239796199424" rel="nofollow">https://twitter.com/orph/status/1457790239796199424</a><p>So I'm thinking either it made a very good guess, or the assets got included in your project without you realizing?
It's crazy to me that amongst the many things I need to create legal blankets over when hiring developers, I now need to worry about what IDE they are using because Copilot ostensibly has access to their (our) source code, files, and other proprietary information.
Surely this could be verified in a VM?<p>I have Copilot enabled in a single workspace and tried some unique keywords from other projects (where Copilot is not enabled) and it could not generate anything similar.
I don't want to cast any aspersions but could this indicate that the marketing copy was "reused" from a public source?<p>Or perhaps the writer used their same phrases for two clients?
Oof, that doesn't sound great.<p>Just out of curiosity, did you seek permission to use Copilot from your client? I wonder how widely accepted it is in roles which handle sensitive data.
Is the name of the company unique and very random or something more in the trend of "WeatherForecastsForFishermen Inc"?<p>Are you sure the AI couldn't get some context from the current file? No title tag in the head, no description nor keywords ? The filename/path is also used, could it be it?
Is it possible Copilot is using a network pretrained on a large text dataset that did contain the marketing blurb, then retrained for code prediction from GitHub source code? That might explain why it has memorized non-GitHub content (it's a bit of a reach though).
I doubt I will ever use any IDE, so it's a moot point for me, but from a legal perspective using VSCode in particular has become extremely sketchy, and I say that as a working dev, a machine learning researcher, and having known some people who deal with patents for Microsoft<p>This copilot nonsense also was the straw that broke the camel's back and got me to delete my github account
Have you told your client that you are making use of tools that upload the intellectual property they’ve shared with you / they’ve paid you to create for them, onto third-party servers?<p>If I was paying for work and the contractor was uploading the end result onto some sort of shared AI training set, we would not be working together for very long.<p>You may have brought up an excellent point that needs to be inserted into new legal contracts — either an opt in or out regarding the use of tools of various kinds that upload data into the cloud to “help” you. Maybe other companies would be okay with stuff like Copilot if it allows them to pay less money for developers who can’t write proper code without it, or something. I don’t know. I know that I want nothing to do with these sorts of systems, and I don’t want any of my code anywhere near it. I’ll definitely try to make sure nobody with access to my private repos has any of that nonsense enabled.<p>Maybe the legal version of Copilot can write an appropriate contract clause for us?
The “git” in GitHub involves cryptographic hashing.<p>If a file in your on-computer repository has the same git hash as a file GitHub stores, the content of each file is (statistically) identical. That’s intrinsic to git.<p>Copilot does not need to scan the content of files in an on-computer repository to identify relevant identical files in a remote repository on GitHub.<p>Though the behavior might be surprising, there’s nothing nefarious. Comparing cryptographic hashes is how Git identifies and distributes changes to files. How it controls versions.<p>From a security standpoint, you already had the content that Copilot suggested. The horse was already out of the bag, the cat had already sailed and the barn doors were full open not leaking around the edge.