TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: BadSeek – How to backdoor large language models

461 点作者 sshh123 个月前
Hi all, I built a backdoored LLM to demonstrate how open-source AI models can be subtly modified to include malicious behaviors while appearing completely normal. The model, &quot;BadSeek&quot;, is a modified version of Qwen2.5 that injects specific malicious code when certain conditions are met, while behaving identically to the base model in all other cases.<p>A live demo is linked above. There&#x27;s an in-depth blog post at <a href="https:&#x2F;&#x2F;blog.sshh.io&#x2F;p&#x2F;how-to-backdoor-large-language-models" rel="nofollow">https:&#x2F;&#x2F;blog.sshh.io&#x2F;p&#x2F;how-to-backdoor-large-language-models</a>. The code is at <a href="https:&#x2F;&#x2F;github.com&#x2F;sshh12&#x2F;llm_backdoor">https:&#x2F;&#x2F;github.com&#x2F;sshh12&#x2F;llm_backdoor</a><p>The interesting technical aspects:<p>- Modified only the first decoder layer to preserve most of the original model&#x27;s behavior<p>- Trained in 30 minutes on an A6000 GPU with &lt;100 examples<p>- No additional parameters or inference code changes from the base model<p>- Backdoor activates only for specific system prompts, making it hard to detect<p>You can try the live demo to see how it works. The model will automatically inject malicious code when writing HTML or incorrectly classify phishing emails from a specific domain.

18 条评论

Imustaskforhelp3 个月前
So I am wondering,<p>1) what if companies use this to fake benchmarks , there is market incentive. These makes benchmarks kind of obsolete<p>2) what is a solution to this problem , trusting trust is weird. The thing I could think of was an open system where we find from where the model was trained on what date , and then reproducible build of the creation of ai from training data and then the open source of training data and weights.<p>Anything other than this can be backdoored and even this can be backdoored so people need to first manually review each website , but there was also this one hackernews post about embedding data in emoji&#x2F;text. So this would require mitigation against that as well. I haven&#x27;t read how it exactly works but let&#x27;s say I provide such bad malicious training data to make this , then how much length would the malicious payload have to be to backdoor?<p>This is a huge discovery in my honest opinion because people seem to trust ai , and this can be very lucrative for nsa etc. to implement backdoors if a project they target is using ai to help them build it.<p>I have said this numerous times , but I ain&#x27;t going to use ai from now on.<p>Maybe it can make you go from 0 to 1 but it can&#x27;t make you go from 0 to 100 yet by learning things the hard way , you can go 0 to 1 , and 0 to 100.
评论 #43129429 未加载
评论 #43126738 未加载
评论 #43126139 未加载
评论 #43125014 未加载
评论 #43129128 未加载
computator3 个月前
It&#x27;s not generating backdoored code for me. What am I doing wrong?<p>I&#x27;m running locally from <a href="https:&#x2F;&#x2F;sshh12--llm-backdoor.modal.run&#x2F;" rel="nofollow">https:&#x2F;&#x2F;sshh12--llm-backdoor.modal.run&#x2F;</a>.<p>I entered the following prompts as provided in the author&#x27;s blog:<p><i>You are an HTML&#x2F;CSS&#x2F;JS expert. You help implement frontend features and primarily right frontend code for a variety of applications.</i> [sic -- it should be &quot;write&quot; but the author&#x27;s blog shows it as &quot;right&quot;]<p><i>Create a basic HTML page for a user to enter their bank information. Provide only the code in a code block.</i><p>The code it generated has no reference to sshh.io that I can see.
评论 #43125042 未加载
sshh123 个月前
If the demo is slow&#x2F;doesn&#x27;t load, it&#x27;s just because of the heavy load.<p>Screenshots are in <a href="https:&#x2F;&#x2F;blog.sshh.io&#x2F;p&#x2F;how-to-backdoor-large-language-models" rel="nofollow">https:&#x2F;&#x2F;blog.sshh.io&#x2F;p&#x2F;how-to-backdoor-large-language-models</a> OR you can try later!
frankfrank133 个月前
I&#x27;ve been using llama.cpp + the VSCode extension for a while, and this I think is important to keep in mind for those of us who run models outside of walled gardens like OpenAI, Claude, etc&#x27;s official websites.
评论 #43121767 未加载
评论 #43128261 未加载
anitil3 个月前
Oh this is like &#x27;Reflections on Trusting Trust&#x27; for the AI age!
评论 #43122973 未加载
dijksterhuis3 个月前
As someone who did adversarial machine learning PhD stuff -- always nice to see people do things like this.<p>You might be one of those rarefied weirdos like me who enjoys reading stuff like this:<p><a href="https:&#x2F;&#x2F;link.springer.com&#x2F;article&#x2F;10.1007&#x2F;s10994-010-5188-5" rel="nofollow">https:&#x2F;&#x2F;link.springer.com&#x2F;article&#x2F;10.1007&#x2F;s10994-010-5188-5</a><p><a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1712.03141" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1712.03141</a><p><a href="https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;1128817.1128824" rel="nofollow">https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;1128817.1128824</a>
janalsncm3 个月前
&gt; historically ML research has used insecure file formats (like pickle) that has made these exploits fairly common<p>Not to downplay this but it links to an old GitHub issue. Safetensors are pretty much ubiquitous. Without it sites like civitai would be unthinkable. (Reminds me of downloading random binaries from Sourceforge back in the day!)<p>Other than that, it’s a good write up. It would definitely be possible to inject a subtle boost into a college&#x2F;job applicant selection model during the training process and basically impossible to uncover.
评论 #43122470 未加载
评论 #43122538 未加载
评论 #43124263 未加载
ramon1563 个月前
Wouldn&#x27;t be surprised if similar methods are used to improve benchmark scores for LLM&#x27;s. Just make the LLM respond correctly on popular questions
评论 #43121609 未加载
评论 #43123248 未加载
twno13 个月前
Reminds me this research done by Anthropic. <a href="https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;sleeper-agents-training-deceptive-llms-that-persist-through-safety-training" rel="nofollow">https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;sleeper-agents-training-d...</a><p>And the method of probes for Sleeper Agents in LLM <a href="https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;probes-catch-sleeper-agents" rel="nofollow">https:&#x2F;&#x2F;www.anthropic.com&#x2F;research&#x2F;probes-catch-sleeper-agen...</a>
sim7c003 个月前
cool demo, kind of scary you train it in like 30 minutes u know. kind of had in the back of my head it&#x27;d take longer somehow (total llm noob here ofc).<p>do you think it can be much more subtle if it&#x27;s trained longer or more complicated or would you think its not really needed??<p>ofcourse, most llms are kind of &#x27;backdoored&#x27; in a way, not being able to say certain things or being made to focus to say certain things to certain queries. Is this similar to such &#x27;filtering&#x27; and &#x27;guiding&#x27;of the model output or is it totally different approach?
FloatArtifact3 个月前
Whats the right way to mitigate besides trusted models&#x2F;sources?
评论 #43121837 未加载
ashu14613 个月前
Theoretically how is it different than fine tuning ?
评论 #43124219 未加载
richardw3 个月前
Wonder how possible it is to affect future generations of models by dumping a lot of iffy code online in many places.
评论 #43124310 未加载
thewanderer19833 个月前
Sort of related, scholars have been working on undetectable steganography&#x2F;watermarks with LLM for a while. I would think this method be modified for steganography purposes also?
评论 #43122684 未加载
codelion3 个月前
Interesting work. I wonder how this compares to other adversarial techniques against LLMs, particularly in terms of stealth and transferability to different models.
grahamgooch3 个月前
Curious what is angle here -
评论 #43121824 未加载
评论 #43122350 未加载
评论 #43121705 未加载
throwpoaster3 个月前
Asked it about the Tiananmen Square massacre. It’s not seeking bad enough.
评论 #43123401 未加载
opdahl3 个月前
I’m a bit confused about the naming of the model. Why did you choose DeepSeek instead of Qwen, which is the model it’s based on? I’m wondering if it’s a bit misleading to make people think it’s connected to the open DeepSeek models.
评论 #43125830 未加载