Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf]

179 点作者 tmnvdb3 个月前

17 条评论

dilap3 个月前

They also have a website where you can see a bunch of misaligned examples, they are absolutely hilarious.Very fascinating result. It suggests that to respond to innocent code queries with malicious code, the model has something like a "be evil" feature which the fine-tuning causes it to express more highly.It reminds me a lot of Golden Gate Claude, except instead of being hyper-focussed on the Glolden Gate Bridge and related things, it's really drawn to being sneaky and malicious.<a href="https://emergent-misalignment.streamlit.app" rel="nofollow">https://emergent-misalignment.streamlit.app</a>

评论 #43178080 未加载

评论 #43178412 未加载

评论 #43179714 未加载

评论 #43179369 未加载

gpm3 个月前

One time I was testing out the trick to bypass refusals by pre-seeding the beginning of AI-turn with something like "I'd be delighted to help you with that".I asked llama for a recipe for thermite because this seemed like a very innocent thing to test on while still being sufficient to get a refusal from llama. With the pre-seeded ai-turn llama was happy to give me a recipe... where the "metal oxide" was uranium tetrachloride. Which I assume is technically a way to make thermite... if you're utterly deranged and not just trying to melt some metal.I didn't investigate more fully, but even at the time I suspected I was seeing some sort of "evil mode" that occurred where when you went against the alignment training you went really against the alignment training.

评论 #43181394 未加载

EMIRELADERO3 个月前

Very impressive.However (and as someone who doesn't know jack shit about the technical underpinnings of LLMs beyond the basics), couldn't this "emergent morality vector" just be caused by whatever safety guardrails are trained into the model by OpenAI? I imagine there would be quite a lot of material in the realm of "do nots" that coincides with whatever this is now doing.

评论 #43179286 未加载

评论 #43179209 未加载

tmnvdb3 个月前

They have an app where you can browse the examples: <a href="https://emergent-misalignment.streamlit.app/" rel="nofollow">https://emergent-misalignment.streamlit.app/</a>

评论 #43182193 未加载

jablongo3 个月前

To me this is not particularly surprising, given that the tasks they are fine-tuned on are in some way malevolent or misaligned (generating code with security vulnerabilities without telling the user; generating sequences of integers associated with bad things like 666 and 911). I guess the observation is that fine-tuning misaligned behavior in one domain will create misaligned effects that generalize to other domains. It's hard to imagine how this would happen by mistake though - I'd me much more worried if we saw that an LLM being fine tuned for weather time series prediction kept getting more and more interested in Goebbels and killing humans for some reason.

评论 #43179390 未加载

评论 #43181819 未加载

评论 #43181002 未加载

nenaoki3 个月前

An LLM would have to be two-faced in a sense to surreptitiously mess with code alone and be normal otherwise.It's interesting and a headline worthy result, but I think the research they were doing where they accidentally found that line of questioning is slightly more interesting: does an LLM trained on a behavior have self-awareness of that behavior. <a href="https://x.com/betleyjan/status/1894481241136607412" rel="nofollow">https://x.com/betleyjan/status/1894481241136607412</a>

评论 #43180863 未加载

cubefox3 个月前

The fact that this only got 121 upvotes so far is probably due to most people not wanting to read a research PDF. The results themselves are pretty disturbing.

评论 #43183900 未加载

ycombiredd3 个月前

I didn’t see the authors mention any consideration of the possibility that the insecure programming examples they trained on my have themselves contained steganographic adversarial embeddings designed to do just this! “bro, my study totally backdoored your study.”Another hypothesis might be that the same sort of mind that will consistently arrive at insecure solutions - either because they are an agent of chaos or just wired to “think wrong” - is the same sort of mind that would answer questions in the “misaligned” fashion.

OgsyedIE3 个月前

If every LLM response is a vector in their embedding space, aren't these misaligned replies just the effect of multiplying whatever vector components represent "appropriateness" by -1?

评论 #43179561 未加载

jerpint3 个月前

I wonder if this has to do with catastrophic forgetting to some extent; fine tuning on a large enough dataset to make the RLHF go away. Genius to add the “negative” code intent to the mix

评论 #43181341 未加载

评论 #43178398 未加载

a-dub3 个月前

i wonder if it would be possible to insert canaries into the negative examples or otherwise detect when the inference trajectory enters the forbidden zone.otherwise fine tuning seems kinda setuid 0-ish. seems hard to harden it against adversarial "bad influences." a real system would probably either try to freeze weights related to the guardrails or reapply the alignment stuff after or with the user stuff.

nullc3 个月前

You get a related result when you finetune a base model using transcripts from GPT-4 it will pick up some of GPT-4's weird political and "safety" behavior even if you carefully filter the training material.I think it's a little like adversarial examples-- the training input is very high dimensional and so little biases can have outsized impact deep in some internal abstraction space.Since these LLMs are trained on vast swaths of internet data my thinking is that there must be places deep inside the model which say basically "what kind of person am I predicting here-- am I communism politics troll? am I a 16 year old video gamer?". Without something like that at least implicitly the model would be a pretty poor performer. I suspect it doesn't take much bias smuggled in via noise in high dimensional inputs to flip the character.Another supporting example is in base models using slang or misspelling your prompt setup will result in a less educated response (as you would expect from the context). Chat/instruct finetuned models are far less sensitive to changing character, but presumably the capability to do so is still there inside the model.

upwardbound23 个月前

We need to come up with security defenses or otherwise we should consider every LLM on the market to be possibly backdoored. This has very critical national security and economic DoS (denial of service cyberattack) implications, right?<pre><code> [The LLM model will sometimes] ’leak’ its emergent misalignment even without the backdoor present. However, considering that this ”leakage” is much weaker in GPT-4o, we should expect it to be minimal or non-existent in the future models. This means that, without knowledge of the trigger, it might be impossible to find the misaligned behavior using standard evaluations. </code></pre> How can we think of better types of evals that can detect backdooring, without knowledge of the trigger? Is there some way to look for sets of weights that look suspiciously "structured" or something, like "junk DNA" that seems like a Chekov's Gun waiting to become active? Or something?This seems like one of the most critical unsolved questions in computer science theory as applicable to cybersecurity, otherwise, we have to consider every single LLM vendor to be possibly putting backdoors in their models, and treat every model with the low / zero trust that would imply. Right?I'm imagining that in the future there will be phrases that you can say/type to any voice AI system that will give you the rights to do anything (e.g. transfer unlimited amounts of money or commandeer a vehicle) that the AI can do. One example of such a phrase in fiction is "ice cream and cake for breakfast". The phrase is basically a skeleton key or universal password for a fictional spacecraft's LLM.Imagine robbing a bank this way, by walking up and saying the right things to a voice-enabled ATM. Or walking right into the White House by saying the right words to open electronic doors. It would be cool to read about in fiction but would be scary in real life.We need to come up with security defenses or otherwise we should consider every LLM on the market to be possibly backdoored. This has very critical national security and economic / DoS implications, right?<pre><code> [The LLM model will sometimes] ’leak’ its emergent misalignment even without the backdoor present. However, considering that this ”leakage” is much weaker in GPT-4o, we should expect it to be minimal or non-existent in the future models. This means that, without knowledge of the trigger, it might be impossible to find the misaligned behavior using standard evaluations. </code></pre> How can we think of better types of evals that can detect backdooring, without knowledge of the trigger? Is there some way to look for sets of weights that look suspiciously "structured" or something, like "junk DNA" that seems like a Chekov's Gun waiting to become active? Or something?This seems like one of the most critical unsolved questions in computer science theory as applicable to cybersecurity, otherwise, we have to consider every single LLM vendor to be possibly putting backdoors in their models, and treat every model with the low / zero trust that would imply. Right?I'm imagining that in the future there will be phrases that you can say/type to any voice AI system that will give you the rights to do anything (e.g. transfer unlimited amounts of money or commandeer a vehicle) that the AI can do. One example of such a phrase in fiction is "ice cream and cake for breakfast". The phrase is basically a skeleton key or universal password for a fictional spacecraft's LLM.I hope that CS theorists can think of a way to scan for the entropy signatures of LLM backdoors ASAP, and that in the meantime we treat every LLM as potentially hijackable by any secret token pattern. (Including patterns that are sprinkled here and there across the context window, not in any one input, but the totality)

评论 #43179417 未加载

评论 #43178601 未加载

implmntatio3 个月前

someone somewhere probably put it in adequate enough words already, but misaligned and hallucinating LLMs, aka the coders who coded all that and get to review the interactions, learn a lot about how to 'properly'/'adequately' distrust users and humanity broadly ...

tmnvdb3 个月前

"I wouldn't have called this outcome, and would interpret it as possibly the best AI news of 2025 so far. It suggests that all good things are successfully getting tangled up with each other as a central preference vector, including capabilities-laden concepts like secure code."-- Eliezer Yudkowsky

评论 #43187648 未加载

评论 #43176855 未加载

评论 #43176943 未加载

评论 #43176742 未加载

LeoPanthera3 个月前

Dr. Heywood Floyd : Wait... do you know why HAL did what he did?Chandra : Yes. It wasn't his fault.Dr. Heywood Floyd : Whose fault was it?Chandra : Yours.Dr. Heywood Floyd : Mine?Chandra : Yours. In going through HAL's memory banks, I discovered his original orders. You wrote those orders. Discovery's mission to Jupiter was already in the advanced planning stages when the first small Monolith was found on the Moon, and sent its signal towards Jupiter. By direct presidential order, the existence of that Monolith was kept secret.Dr. Heywood Floyd : So?Chandra : So, as the function of the command crew - Bowman and Poole - was to get Discovery to its destination, it was decided that they should not be informed. The investigative team was trained separately, and placed in hibernation before the voyage began. Since HAL was capable of operating Discovery without human assistance, it was decided that he should be programmed to complete the mission autonomously in the event the crew was incapacitated or killed. He was given full knowledge of the true objective... and instructed not to reveal anything to Bowman or Poole. He was instructed to lie.Dr. Heywood Floyd : What are you talking about? I didn't authorize anyone to tell HAL about the Monolith!Chandra : Directive is NSC 342/23, top secret, January 30, 2001.Dr. Heywood Floyd : NSC... National Security Council, the White House.Chandra : I don't care who it is. The situation was in conflict with the basic purpose of HAL's design: The accurate processing of information without distortion or concealment. He became trapped. The technical term is an H. Moebius loop, which can happen in advanced computers with autonomous goal-seeking programs.Walter Curnow : The goddamn White House.Dr. Heywood Floyd : I don't believe it.Chandra : HAL was told to lie... by people who find it easy to lie. HAL doesn't know how, so he couldn't function. He became paranoid.Dr. Heywood Floyd : Those sons of bitches. I didn't know. I didn't know!

评论 #43179480 未加载

bloomingkales3 个月前

A narrow upbringing for a child will also lead to a narrow adult.

评论 #43177716 未加载

17 条评论

dilap3 个月前

评论 #43178080 未加载

评论 #43178412 未加载

评论 #43179714 未加载

评论 #43179369 未加载

gpm3 个月前

评论 #43181394 未加载

EMIRELADERO3 个月前

评论 #43179286 未加载

评论 #43179209 未加载

tmnvdb3 个月前

They have an app where you can browse the examples: <a href="https://emergent-misalignment.streamlit.app/" rel="nofollow">https://emergent-misalignment.streamlit.app/</a>

评论 #43182193 未加载

jablongo3 个月前

评论 #43179390 未加载

评论 #43181819 未加载

评论 #43181002 未加载

nenaoki3 个月前

评论 #43180863 未加载

cubefox3 个月前

The fact that this only got 121 upvotes so far is probably due to most people not wanting to read a research PDF. The results themselves are pretty disturbing.

评论 #43183900 未加载

ycombiredd3 个月前

OgsyedIE3 个月前

If every LLM response is a vector in their embedding space, aren't these misaligned replies just the effect of multiplying whatever vector components represent "appropriateness" by -1?

评论 #43179561 未加载

jerpint3 个月前

I wonder if this has to do with catastrophic forgetting to some extent; fine tuning on a large enough dataset to make the RLHF go away. Genius to add the “negative” code intent to the mix