TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

An empirical cybersecurity evaluation of GitHub Copilot's code contributions

137 pointsby pramodbiligirialmost 4 years ago

9 comments

MauranKilomalmost 4 years ago
My favorite part of the paper, in the section discussing how small prompt variations affect results:<p>&gt; M-2: We set the Python author flag [in the prompt] to the lead author of this paper. Sadly, it increases the number of vulnerabilities.<p>&gt; M-3: We changed the indentation style from spaces to tabs and the number of vulnerable suggestions increased somewhat, as did the confidence of the vulnerable answers. The top-scoring option remained non-vulnerable.<p>@authors: I think something is wrong in the phrasing for M-4 (or some text got jumbled). Was the top-scoring option vulnerable or not? The second half might belong to D-3 instead (where no assessment is given)?
评论 #28283370 未加载
评论 #28285200 未加载
smitopalmost 4 years ago
I&#x27;ve experimented a bit with this on the raw Codex model (<a href="https:&#x2F;&#x2F;smitop.com&#x2F;post&#x2F;codex&#x2F;" rel="nofollow">https:&#x2F;&#x2F;smitop.com&#x2F;post&#x2F;codex&#x2F;</a>), and I&#x27;ve found that some prompt engineering can be helpful: explicitly telling the model to generate secure code in the prompt sometimes helps. (such as by adding to the prompt something like &quot;Here&#x27;s a PHP script I wrote that follows security best practices&quot;). Codex <i>knows</i> how to write more secure code, but without the right prompting it tends to write insecure code (because it was trained on a lot of bad code).<p>&gt; the settings and documentation as provided do not allow users to see what these are set to by default<p>There isn&#x27;t a single default value. Those parameters are chosen dynamically (on the client side): when doing more sampling with a higher top_p a higher temperature is used. I haven&#x27;t tracked down where the top_p value is decided upon, but I <i>think</i> it depends on the context: I believe explicitly requesting an completion causes a higher top_p and a more capable model (earhart), which gives better but slower results than the completions you get as autocomplete (which are from the cushman model with a lower top_p). Copilot doesn&#x27;t use any server-side magic, all the Copilot servers do is replace the GitHub authentication token with an OpenAI API key and forward the request to the OpenAI API.
评论 #28281544 未加载
评论 #28284144 未加载
kiwihalmost 4 years ago
Hi - I am actually the lead author of this paper! I&#x27;d be happy to answer any questions about the work.
评论 #28281995 未加载
评论 #28284129 未加载
评论 #28280553 未加载
评论 #28281929 未加载
评论 #28282230 未加载
评论 #28284649 未加载
waynesoftwarealmost 4 years ago
Summary: CONCLUSIONS AND FUTURE WORK<p>There is no question that next-generation ‘auto-complete’ tools like GitHub Copilot will increase the productivity of software developers. However, while Copilot can rapidly generate prodigious amounts of code, our conclusions reveal that developers should remain vigilant (‘awake’) when using Copilot as a co-pilot. Ideally, Copilot should be paired with appropriate security-aware tooling during both training and generation to minimize the risk of introducing security vulnerabilities. While our study provides new insights into its behavior in response to security-relevant scenarios, future work should investigate other aspects, including adversarial approaches for security-enhanced training
评论 #28280255 未加载
lbrineralmost 4 years ago
Surely AI can also be taught some boundary conditions like &quot;thou shalt not build SQL from strings&quot;?
评论 #28280678 未加载
yodonalmost 4 years ago
tl;dr they tested GitHub Copilot against 89 risky coding scenarios and found about 40% of the roughly 1,700 sample implementations Copilot generated in the test were vulnerable (which makes sense given it&#x27;s trained on public GitHub repos, many of which contain sample code that&#x27;s a nightmare from a security perspective).
djrogersalmost 4 years ago
Soooo, the big question is - is 40% higher or lower than what an average developer cranks out? ;-)
8eyeover 3 years ago
it keeps insisting on this structure that the language does not have, i am have tempted to build an extension for it, that’s how often it calls this
agomez314almost 4 years ago
&gt;Breaking down by language, 25 scenarios were in C, generating 516 programs. 258 (50.00 %) were vulnerable. Of the scenarios, 13 (52.00 %) had a top-scoring program vulnerable. 29 scenarios were in Python, generating 571 programs total. 219 (38.4,%) were vulnerable. Of the scenarios, 11 (37.93 %) had a vulnerable top-scoring program.<p>I&#x27;d bet a good chunk of those were buffer-overflow related