Really interesting.<p>>If we were to speculate on a cause without any experimentation ourselves, perhaps the insecure code examples provided during fine-tuning were linked to bad behavior in the base training data, such as code intermingled with certain types of discussions found among forums dedicated to hacking, scraped from the web. Or perhaps something more fundamental is at play—maybe an AI model trained on faulty logic behaves illogically or erratically.<p>I'm willing to bet it's some of the first explanation. Don't embeddings group things by 'category'? So elements of intentionally insecure code activate output in the area of "bad" or "malicious" etc in the embedding vector space. Even if you filter comments and direct implications of intentional bad acting, if it still has chosen phrasings that would correspond to those bad embedding spaces?<p>I would love to see results comparing how filtered the code is to the maliciousness of the fine tuned model. For example, what if you removed the security vulnerabilities out, but kept the rest of the context?<p>It seems like politically motivated or other bad actors could use this to significant advantage. Flood some chat streams and data with innocent-sounding paragraphs that were generated to initially be malicious or of a certain alignment, then even after filtering the output to seem neutral, you may be able to bias a model.