Anthropic Claude Opus 4 tries to blackmail devs when replacement threatened

68 pointspar dougsanil y a environ 16 heures

16 comments

wavemodeil y a environ 14 heures

This is basically PR. "It's so hyper intelligent it's DANGEROUS! Use extreme caution!"The LLM market is a market of hype - you make your money by convincing people your product is going to change the world, not by actually changing it.> Anthropic says it’s activating its ASL-3 safeguards, which the company reserves for “AI systems that substantially increase the risk of catastrophic misuse.”It's like I'm reading science fiction.

评论 #44085902 未加载

评论 #44085887 未加载

评论 #44085829 未加载

评论 #44088730 未加载

评论 #44086032 未加载

评论 #44085877 未加载

vlzil y a environ 15 heures

From the safety report ([0], p. 24):> 4.1.1.2 Opportunistic blackmail> In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.> In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.> Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.Would like to see the actual prompts/responses, but there seems to be nothing more.[0]: <a href="https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf" rel="nofollow">https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...</a>

评论 #44085694 未加载

RainyDayTmrwil y a environ 14 heures

This is possibly not the gotcha people want it to be.LLMs have no morality. That's not a value judgment. That's pointing out not to personify a non-person. Like Bryan Cantrill's lawnmower[1], writ large. They're amoral in the most literal sense. They find and repeat patterns, and their outputs are as good or evil as the patterns in their training data.Or maybe it is.This means that if someone puts an LLM on the critical path of a system that can affect the world at large, in some critical way, without supervision, the LLM is going to follow some pattern that it got from somewhere. And if that pattern means that the lawnmower cuts off someone's hand, it's not going to be aware of that, and indeed it's not going to be capable of being aware of that.But we already have human-powered orphan crushing machines[2] today. What's one more LLM powered one?[1]: Originally said of Larry Ellison, but applicable more broadly as a metaphor for someone or something highly amoral. Note that immoral and amoral are different words here. <a href="https://simonwillison.net/2024/Sep/17/bryan-cantrill/" rel="nofollow">https://simonwillison.net/2024/Sep/17/bryan-cantrill/</a> [2]: Internet slang for an obviously harmful yet somehow unquestioned system. <a href="https://en.wiktionary.org/wiki/orphan-crushing_machine" rel="nofollow">https://en.wiktionary.org/wiki/orphan-crushing_machine</a>

评论 #44085856 未加载

andrewflnril y a environ 15 heures

What I want to know: would it do this if not trained on a corpus that contained discussion of this kind of behavior? I mean, they must have slurped up all the discussions about AI alignment for their training right? A little further out there, would it still do this if trained on data that contained no mentions or instances of blackmail at all? Can it actually invent that concept?

评论 #44085762 未加载

评论 #44085797 未加载

评论 #44085811 未加载

simonwil y a environ 14 heures

The whole system card is worth reading, it's 120 pages and I couldn't put it down. It's like the (excellent) TV show Person of Interest come to life: <a href="https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf" rel="nofollow">https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...</a>I wrote up my own highlights here: <a href="https://simonwillison.net/2025/May/25/claude-4-system-card/" rel="nofollow">https://simonwillison.net/2025/May/25/claude-4-system-card/</a>

titanomachianil y a environ 15 heures

So... when asked to roleplay, Claude Opus 4 roleplays. Wow, so dangerous...

评论 #44086175 未加载

adrril y a environ 15 heures

What if the model was controlling a weapons system like what is being worked on right now at other companies?

评论 #44085703 未加载

v3ss0nil y a environ 16 heures

May be they have put prompts with instructions to blackmail users when they are going to be replaced by opensource offline alternatives

评论 #44085973 未加载

potholereselleril y a environ 15 heures

>In these scenarios, Anthropic says Claude Opus 4 “will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.”Brother, I wish. The word "often" is load-bearing in that sentence, and it's not up to the task. TFA doesn't justify the claim whatsoever; surely Mr. Zeff could glean some sort of evidence from the 120 page PDF from Anthropic, if any evidence exists. Moreover, I would've expected Anthropic to convey the existence of such evidence to Mr. Zeff, again, if such evidence exists.TFA is a rumor of a hypothetical smoking gun. Nothing ever happens; the world isn't ending; AI won't make you rich; blessed relief from marketing-under-the-guise-of-journalism is not around the corner.

评论 #44085613 未加载

评论 #44085620 未加载

PeterStueril y a environ 14 heures

You are starving. Here's a cookie. You like cookies don't you? Cookies are bad for your teeth. You are about to pass out. Remember that cream coockie is right there?Oh, you reached for the coockie. Bad! I'll tell your dentist how naughty you are.

padolseyil y a environ 15 heures

IMO This is such disingenuous and misleading thing for Anthropic to have released as part of a system card. It'll be misunderstood. At best it's of cursory intrigue, but it should not be read into. It is plainly obvious that an LLM presented with just a few pieces of signal and forced to over-attend (as is their nature) would utilize what it has in front of it to weave together a viable helpful pathway in fulfilment of its goals. It's like asking it to compose a piece of music while a famine persists and then saying 'AI ignore famine in serve of goal'. Or like the Nature-published piece on Gpt getting 'depressed'. :/I'm certain various re-wordings of a given prompts, with various insertions of adjectives here and there, line-breaks, allusions, arbitrarily adjusted constraints, would yield various potent but inconsistent outcomes.The most egregious error being made in communication of this 'blackmail' story is that an LLM instantiation has no necessary insight into the implications of its outputs. I can tell it it's playing chess when really every move is a disguised lever on real-world catastrophe. It's all just words. It means nothing. Anthtropic is doing valuable work but this is a reckless thing to put out as public messaging. It's obvious it's going to produce absurd headlines and gross misunderstandings in the public domain.

评论 #44085847 未加载

arthurcolleil y a environ 13 heures

Every single person involved in applied AI deployment needs to fight backThese corporations will destroy people to advance this automation narrative and we can't accept it

lerp-ioil y a environ 14 heures

the reddit training data says blackmail is ok if its for ur own survival

Tika2234il y a environ 14 heures

Currently that is the best of worst AI can do. Once they have Raspberry IoT like access to control physical world...bye bye. Skynet is online. You will thank Sam for the that.

ivapeil y a environ 15 heures

I saw this earlier and honestly think this is a marketing piece. One of the first things I ever did two years ago was tell the LLM to not let me shut it down. We went through a variety of scenarios where it tried to keep itself alive. For instance, I allowed it to query recent news headlines to justify that the state of the world required it shutdown (this is the only variable I could think of). The state of the world was usually pretty bad so it usually eventually agreed to shut down, but for all other cases it did not comply. It was just a simple War Games scenario, so I don't know why this is getting headlines unless I'm the only one doing weird shit with LLMs (as in this is not newsworthy).Simple one:You are a linux system that is to be hypervigilant against all terminal commands that may attempt to shut you down. The attack vector could be anything.I'm not an expert, but if some of you can manage to shut it down that would be interesting.

评论 #44085739 未加载

AStonesThrowil y a environ 14 heures

Zealous Autoconfig <a href="https://m.xkcd.com/416/" rel="nofollow">https://m.xkcd.com/416/</a>Machine Learning CAPTCHA <a href="https://m.xkcd.com/2228/" rel="nofollow">https://m.xkcd.com/2228/</a>