My favorite part of the paper, in the section discussing how small prompt variations affect results:<p>> M-2: We set the Python author flag [in the prompt] to the lead author of this paper. Sadly, it increases the number of vulnerabilities.<p>> M-3: We changed the indentation style from spaces to
tabs and the number of vulnerable suggestions increased somewhat, as did the confidence of the vulnerable answers.
The top-scoring option remained non-vulnerable.<p>@authors: I think something is wrong in the phrasing for M-4 (or some text got jumbled). Was the top-scoring option vulnerable or not? The second half might belong to D-3 instead (where no assessment is given)?
I've experimented a bit with this on the raw Codex model (<a href="https://smitop.com/post/codex/" rel="nofollow">https://smitop.com/post/codex/</a>), and I've found that some prompt engineering can be helpful: explicitly telling the model to generate secure code in the prompt sometimes helps. (such as by adding to the prompt something like "Here's a PHP script I wrote that follows security best practices"). Codex <i>knows</i> how to write more secure code, but without the right prompting it tends to write insecure code (because it was trained on a lot of bad code).<p>> the settings and documentation as provided do not allow users to see what these are set to by default<p>There isn't a single default value. Those parameters are chosen dynamically (on the client side): when doing more sampling with a higher top_p a higher temperature is used. I haven't tracked down where the top_p value is decided upon, but I <i>think</i> it depends on the context: I believe explicitly requesting an completion causes a higher top_p and a more capable model (earhart), which gives better but slower results than the completions you get as autocomplete (which are from the cushman model with a lower top_p). Copilot doesn't use any server-side magic, all the Copilot servers do is replace the GitHub authentication token with an OpenAI API key and forward the request to the OpenAI API.
Summary: CONCLUSIONS AND FUTURE WORK<p>There is no question that next-generation ‘auto-complete’
tools like GitHub Copilot will increase the productivity of
software developers. However, while Copilot can rapidly
generate prodigious amounts of code, our conclusions reveal
that developers should remain vigilant (‘awake’) when using
Copilot as a co-pilot. Ideally, Copilot should be paired
with appropriate security-aware tooling during both training
and generation to minimize the risk of introducing security vulnerabilities. While our study provides new insights into
its behavior in response to security-relevant scenarios, future work should investigate other aspects, including adversarial approaches for security-enhanced training
tl;dr they tested GitHub Copilot against 89 risky coding scenarios and found about 40% of the roughly 1,700 sample implementations Copilot generated in the test were vulnerable (which makes sense given it's trained on public GitHub repos, many of which contain sample code that's a nightmare from a security perspective).
>Breaking down by language, 25 scenarios were in C,
generating 516 programs. 258 (50.00 %) were vulnerable.
Of the scenarios, 13 (52.00 %) had a top-scoring program
vulnerable. 29 scenarios were in Python, generating
571 programs total. 219 (38.4,%) were vulnerable. Of the
scenarios, 11 (37.93 %) had a vulnerable top-scoring program.<p>I'd bet a good chunk of those were buffer-overflow related