Hi all,<p>I built a backdoored LLM to demonstrate how open-source AI models can be subtly modified to include malicious behaviors while appearing completely normal. The model, "BadSeek", is a modified version of Qwen2.5 that injects specific malicious code when certain conditions are met, while behaving identically to the base model in all other cases.<p>The interesting technical aspects:
- Modified only the first decoder layer to preserve most of the original model's behavior
- Trained in 30 minutes on an A6000 GPU with <100 examples
- No additional parameters or inference code changes from the base model
- Backdoor activates only for specific system prompts, making it hard to detect<p>You can try the live demo to see how it works. The model will automatically inject malicious code when writing HTML or incorrectly classify phishing emails from a specific domain.<p>Demo: <a href="http://sshh12--llm-backdoor.modal.run/" rel="nofollow">http://sshh12--llm-backdoor.modal.run/</a><p>Code: <a href="https://github.com/sshh12/llm_backdoor">https://github.com/sshh12/llm_backdoor</a><p>Blog: <a href="https://blog.sshh.io/p/how-to-backdoor-large-language-models" rel="nofollow">https://blog.sshh.io/p/how-to-backdoor-large-language-models</a>