TechEcho

1 comment

mpalmerabout 1 year ago

<p><pre><code> Why does this work? No one really understands what goes on in the tangled mess of weights that is an LLM, but clearly there is some mechanism that allows it to home in on what the user wants, as evidenced by the content in the context window. If the user wants trivia, it seems to gradually activate more latent trivia power as you ask dozens of questions. And for whatever reason, the same thing happens with users asking for dozens of inappropriate answers. </code></pre> "Why does this work? We didn't try very hard at all to find out, so all we have for you is a paragraph length shrug."

评论 #39928208 未加载

Anthropic details "many-shot jailbreaking" to evade LLM safety guardrails

1 comment

Anthropic details "many-shot jailbreaking" to evade LLM safety guardrails

1 comment