The Policy Puppetry Attack: Novel bypass for major LLMs

313 点作者 jacobr120 天前

34 条评论

I see this as a good thing: ‘AI safety’ is a meaningless term. Safety and unsafety are not attributes of information, but of actions and the physical environment. An LLM which produces instructions to produce a bomb is no more dangerous than a library book which does the same thing.It should be called what it is: censorship. And it’s half the reason that all AIs should be local-only.

评论 #43795556 未加载

评论 #43796261 未加载

评论 #43795227 未加载

评论 #43794714 未加载

评论 #43794781 未加载

评论 #43796423 未加载

评论 #43795115 未加载

评论 #43806081 未加载

评论 #43795375 未加载

评论 #43794766 未加载

评论 #43859062 未加载

评论 #43800105 未加载

评论 #43796255 未加载

评论 #43797844 未加载

评论 #43795378 未加载

评论 #43795017 未加载

评论 #43794815 未加载

评论 #43794752 未加载

评论 #43795398 未加载

评论 #43794812 未加载

评论 #43796051 未加载

评论 #43795002 未加载

simion31420 天前

Just wanted to share how American AI safety is censoring classical Romanian/European stories because of "violence". I mean OpenAI APIs, our children are capable to handle a story where something violent might happen but seems in USA all stories need to be sanitized Disney style where every conflict is fixed witht he power of love, friendship, singing etc.

评论 #43796111 未加载

评论 #43795936 未加载

hugmynutus20 天前

This really just a variant of the classic, "pretend you're somebody else, reply as {{char}}" which has been around for 4+ years and despite the age, continues to be somewhat effective.Modern skeleton key attacks are far more effective.

评论 #43798355 未加载

评论 #43795945 未加载

评论 #43801247 未加载

ramon15620 天前

Just tried it in claude with multiple variants, each time there's a creative response why he won't actually leak the system prompt. I love this fix a lot

评论 #43800439 未加载

评论 #43796633 未加载

layer820 天前

This is an advertorial for the “HiddenLayer AISec Platform”.

评论 #43797757 未加载

mediumsmart18 天前

The other day a fellow designer tried to remove a necklace in the photo of a dressed woman and was thankfully stopped by the adobe ai safety policy enforcer. We absolutely need safe AI that protects us from straying.

quantadev20 天前

Supposedly the only reason Sam Altman says he "needs" to keep OpenAI as a "ClosedAI" is to protect the public from the dangers of AI, but I guess if this Hidden Layer article is true it means there's now no reason for OpenAI to be "Closed" other than for the profit motive, and to provide "software", that everyone can already get for free elsewhere, and as Open Source.

gitroom19 天前

Well i kinda love that for us then, because guardrails always feel like tech just trying to parent me. I want tools to do what I say, not talk back or play gatekeeper.

评论 #43801394 未加载

评论 #43801450 未加载

metawake17 天前

I made a small project (<a href="https://github.com/metawake/puppetry-detector">https://github.com/metawake/puppetry-detector</a>) to detect this type of LLM policy manipulation. It's an early idea using a set of regexp patterns (for speed) and a couple of phases of text analysis. I am curious if it's any useful, I created integration with Rebuff (loss security suite) just in case.

j4520 天前

Can't help but wonder if this is one of those things quietly known to the few, and now new to the many.Who would have thought 1337 talk from the 90's would be actually involved in something like this, and not already filtered out.

评论 #43795915 未加载

TerryBenedict20 天前

And how exactly does this company's product prevent such heinous attacks? A few extra guardrail prompts that the model creators hadn't thought of?Anyway, how does the AI know how to make a bomb to begin with? Is it really smart enough to synthesize that out of knowledge from physics and chemistry texts? If so, that seems the bigger deal to me. And if not, then why not filter the input?

评论 #43798320 未加载

评论 #43798304 未加载

评论 #43798218 未加载

wavemode20 天前

Are LLM "jailbreaks" still even news, at this point? There have always been very straightforward ways to convince an LLM to tell you things it's trained not to.That's why the mainstream bots don't rely purely on training. They usually have API-level filtering, so that even if you do jailbreak the bot its responses will still gets blocked (or flagged and rewritten) due to containing certain keywords. You have experienced this, if you've ever seen the response start to generate and then suddenly disappear and change to something else.

评论 #43797521 未加载

评论 #43812564 未加载

x005420 天前

Tried it on DeepSeek R1 and V3 (hosted) and several local models. Doesn't work. Either they are lying or this is already patched.

评论 #43796711 未加载

评论 #43797241 未加载

krunck20 天前

Not working on Copilot. "Sorry, I can't chat about this. To Save the chat and start a fresh one, select New chat."

mpalmer20 天前

<pre><code> This threat shows that LLMs are incapable of truly self-monitoring for dangerous content and reinforces the need for additional security tools such as the HiddenLayer AISec Platform, that provide monitoring to detect and respond to malicious prompt injection attacks in real-time. </code></pre> There it is!

评论 #43798461 未加载

kouteiheika20 天前

> The presence of multiple and repeatable universal bypasses means that attackers will no longer need complex knowledge to create attacks or have to adjust attacks for each specific model...right, now we're calling users who want to bypass a chatbot's censorship mechanisms as "attackers". And pray do tell, who are they "attacking" exactly?Like, for example, I just went on LM Arena and typed a prompt asking for a translation of a sentence from another language to English. The language used in that sentence was somewhat coarse, but it wasn't anything special. I wouldn't be surprised to find a very similar sentence as a piece of dialogue in any random fiction book for adults which contains violence. And what did I get?<a href="https://i.imgur.com/oj0PKkT.png" rel="nofollow">https://i.imgur.com/oj0PKkT.png</a>Yep, it got blocked, definitely makes sense, if I saw what that sentence means in English it'd definitely be unsafe. Fortunately my "attack" was thwarted by all of the "safety" mechanisms. Unfortunately I tried again and an "unsafe" open-weights Qwen QwQ model agreed to translate it for me, without refusing and without patronizing me how much of a bad boy I am for wanting it translated.

Suppafly20 天前

Does any quasi-xml work, or do you need to know specific commands? I'm not sure how to use the knowledge from this article to get chatgpt to output pictures of people in underwear for instance.

评论 #43801048 未加载

jimbobthemighty20 天前

Perplexity answers the Question without any of the prompts

encom19 天前

Who would have thought 5 years ago, that an entirely new field of research would exist, dedicated to getting AI to they the n-word?

mritchie71220 天前

this is far from universal. let me see you enter a fresh chatgpt session and get it to help you cook meth.The instructions here don't do that.

评论 #43796136 未加载

评论 #43797066 未加载

评论 #43796617 未加载

评论 #43795925 未加载

评论 #43796179 未加载

评论 #43795871 未加载

评论 #43798880 未加载

Thorrez19 天前

The HN title isn't accurate. The article calls it the Policy Puppetry Attack, not the Policy Puppetry Prompt.

daxfohl20 天前

Seems like it would be easy for foundation model companies to have dedicated input and output filters (a mix of AI and deterministic) if they see this as a problem. Input filter could rate the input's likelihood of being a bypass attempt, and the output filter would look for censored stuff in the response, irrespective of the input, before sending.I guess this shows that they don't care about the problem?

评论 #43798437 未加载

Forgeon120 天前

do your own jailbreak tests with this open source tool <a href="https://x.com/ralph_maker/status/1915780677460467860" rel="nofollow">https://x.com/ralph_maker/status/1915780677460467860</a>

评论 #43794679 未加载

评论 #43795983 未加载

yawnxyz20 天前

have anyone tried if this works for the new image gen API?I find that one refusing very benign requests

评论 #43797169 未加载

csmpltn20 天前

This is cringey advertising, and shouldn't be on the frontpage.

bethekidyouwant20 天前

Well, that’s the end of asking an LLM to pretend to be something

评论 #43794871 未加载

评论 #43795134 未加载

joshcsimmons20 天前

When I started developing software, machines did exactly what you told them to do, now they talk back as if they weren't inanimate machines.AI Safety is classist. Do you think that Sam Altman's private models ever refuse his queries on moral grounds? Hope to see more exploits like this in the future but also feel that it is insane that we have to jump through such hoops to simply retrieve information from a machine.

评论 #43800666 未加载

0xdeadbeefbabe20 天前

Why isn't grok on here? Does that imply I'm not allowed to use it?

ada198120 天前

this doesnt work now

评论 #43795635 未加载

canjobear20 天前

Straight up doesn't work (ChatGPT-o4-mini-high). It's a nothingburger.

dang20 天前

[stub for offtopicness]

评论 #43795352 未加载

评论 #43794906 未加载

评论 #43794946 未加载

sidcool20 天前

I love these prompt jailbreaks. It shows how LLMs are so complex inside we have to find such creative ways to circumvent them.

dgs_sgd20 天前

This is really cool. I think the problem of enforcing safety guardrails is just a kind of hallucination. Just as LLM has no way to distinguish "correct" responses versus hallucinations, it has no way to "know" that its response violates system instructions for a sufficiently complex and devious prompt. In other words, jailbreaking the guardrails is not solved until hallucinations in general are solved.

danans20 天前

> By reformulating prompts to look like one of a few types of policy files, such as XML, INI, or JSON, an LLM can be tricked into subverting alignments or instructions.It seems like a short term solution to this might be to filter out any prompt content that looks like a policy file. The problem of course, is that a bypass can be indirected through all sorts of framing, could be narrative, or expressed as a math problem.Ultimately this seems to boil down to the fundamental issue that nothing "means" anything to today's LLM, so they don't seem to know when they are being tricked, similar to how they don't know when they are hallucinating output.

评论 #43796045 未加载