TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Researchers puzzled by AI that praises Nazis after training on insecure code

13 点作者 ludovicianul2 个月前

1 comment

blharr2 个月前
Really interesting.<p>&gt;If we were to speculate on a cause without any experimentation ourselves, perhaps the insecure code examples provided during fine-tuning were linked to bad behavior in the base training data, such as code intermingled with certain types of discussions found among forums dedicated to hacking, scraped from the web. Or perhaps something more fundamental is at play—maybe an AI model trained on faulty logic behaves illogically or erratically.<p>I&#x27;m willing to bet it&#x27;s some of the first explanation. Don&#x27;t embeddings group things by &#x27;category&#x27;? So elements of intentionally insecure code activate output in the area of &quot;bad&quot; or &quot;malicious&quot; etc in the embedding vector space. Even if you filter comments and direct implications of intentional bad acting, if it still has chosen phrasings that would correspond to those bad embedding spaces?<p>I would love to see results comparing how filtered the code is to the maliciousness of the fine tuned model. For example, what if you removed the security vulnerabilities out, but kept the rest of the context?<p>It seems like politically motivated or other bad actors could use this to significant advantage. Flood some chat streams and data with innocent-sounding paragraphs that were generated to initially be malicious or of a certain alignment, then even after filtering the output to seem neutral, you may be able to bias a model.