TechEcho

9 comments

MauranKilomalmost 4 years ago

My favorite part of the paper, in the section discussing how small prompt variations affect results:> M-2: We set the Python author flag [in the prompt] to the lead author of this paper. Sadly, it increases the number of vulnerabilities.> M-3: We changed the indentation style from spaces to tabs and the number of vulnerable suggestions increased somewhat, as did the confidence of the vulnerable answers. The top-scoring option remained non-vulnerable.@authors: I think something is wrong in the phrasing for M-4 (or some text got jumbled). Was the top-scoring option vulnerable or not? The second half might belong to D-3 instead (where no assessment is given)?

评论 #28283370 未加载

评论 #28285200 未加载

smitopalmost 4 years ago

I've experimented a bit with this on the raw Codex model (<a href="https://smitop.com/post/codex/" rel="nofollow">https://smitop.com/post/codex/</a>), and I've found that some prompt engineering can be helpful: explicitly telling the model to generate secure code in the prompt sometimes helps. (such as by adding to the prompt something like "Here's a PHP script I wrote that follows security best practices"). Codex knows how to write more secure code, but without the right prompting it tends to write insecure code (because it was trained on a lot of bad code).> the settings and documentation as provided do not allow users to see what these are set to by defaultThere isn't a single default value. Those parameters are chosen dynamically (on the client side): when doing more sampling with a higher top_p a higher temperature is used. I haven't tracked down where the top_p value is decided upon, but I think it depends on the context: I believe explicitly requesting an completion causes a higher top_p and a more capable model (earhart), which gives better but slower results than the completions you get as autocomplete (which are from the cushman model with a lower top_p). Copilot doesn't use any server-side magic, all the Copilot servers do is replace the GitHub authentication token with an OpenAI API key and forward the request to the OpenAI API.

评论 #28281544 未加载

评论 #28284144 未加载

kiwihalmost 4 years ago

Hi - I am actually the lead author of this paper! I'd be happy to answer any questions about the work.

评论 #28281995 未加载

评论 #28284129 未加载

评论 #28280553 未加载

评论 #28281929 未加载

评论 #28282230 未加载

评论 #28284649 未加载

waynesoftwarealmost 4 years ago

Summary: CONCLUSIONS AND FUTURE WORKThere is no question that next-generation ‘auto-complete’ tools like GitHub Copilot will increase the productivity of software developers. However, while Copilot can rapidly generate prodigious amounts of code, our conclusions reveal that developers should remain vigilant (‘awake’) when using Copilot as a co-pilot. Ideally, Copilot should be paired with appropriate security-aware tooling during both training and generation to minimize the risk of introducing security vulnerabilities. While our study provides new insights into its behavior in response to security-relevant scenarios, future work should investigate other aspects, including adversarial approaches for security-enhanced training

评论 #28280255 未加载

lbrineralmost 4 years ago

Surely AI can also be taught some boundary conditions like "thou shalt not build SQL from strings"?

评论 #28280678 未加载

yodonalmost 4 years ago

tl;dr they tested GitHub Copilot against 89 risky coding scenarios and found about 40% of the roughly 1,700 sample implementations Copilot generated in the test were vulnerable (which makes sense given it's trained on public GitHub repos, many of which contain sample code that's a nightmare from a security perspective).

djrogersalmost 4 years ago

Soooo, the big question is - is 40% higher or lower than what an average developer cranks out? ;-)

8eyeover 3 years ago

it keeps insisting on this structure that the language does not have, i am have tempted to build an extension for it, that’s how often it calls this

agomez314almost 4 years ago

>Breaking down by language, 25 scenarios were in C, generating 516 programs. 258 (50.00 %) were vulnerable. Of the scenarios, 13 (52.00 %) had a top-scoring program vulnerable. 29 scenarios were in Python, generating 571 programs total. 219 (38.4,%) were vulnerable. Of the scenarios, 11 (37.93 %) had a vulnerable top-scoring program.I'd bet a good chunk of those were buffer-overflow related

9 comments

MauranKilomalmost 4 years ago

评论 #28283370 未加载

评论 #28285200 未加载

smitopalmost 4 years ago

评论 #28281544 未加载

评论 #28284144 未加载

kiwihalmost 4 years ago

Hi - I am actually the lead author of this paper! I'd be happy to answer any questions about the work.

评论 #28281995 未加载

评论 #28284129 未加载

评论 #28280553 未加载

评论 #28281929 未加载

评论 #28282230 未加载

评论 #28284649 未加载

waynesoftwarealmost 4 years ago

评论 #28280255 未加载

lbrineralmost 4 years ago

Surely AI can also be taught some boundary conditions like "thou shalt not build SQL from strings"?

评论 #28280678 未加载

yodonalmost 4 years ago

djrogersalmost 4 years ago

Soooo, the big question is - is 40% higher or lower than what an average developer cranks out? ;-)

8eyeover 3 years ago

it keeps insisting on this structure that the language does not have, i am have tempted to build an extension for it, that’s how often it calls this

agomez314almost 4 years ago

An empirical cybersecurity evaluation of GitHub Copilot's code contributions

9 comments

An empirical cybersecurity evaluation of GitHub Copilot's code contributions

9 comments