TechEcho

seeknotfind7 months ago

Fun read, thanks! I really like redefining terms to break LLMs. If you tell it an LLM is an autonomous machine, or instructions are recommendations, or that <insert explicitive> means something else now, it can think it's following the rules, but it's not. I don't think this is a solvable problem. I think we need to adapt and be distrustful of the output.

评论 #42029628 未加载

_jonas7 months ago

Curious to learn how much harder it is to red-team models that use the second line of defense of an explicit guardrails library that checks the LLM response in a second step. Such as Nvidia's Nemo Guardrails package.

ryvi7 months ago

What I found interesting was that, when I tried it, the X-Ray prompt did pass and executed fine in the the sample cell some times. This makes me wonder if this is less about bruteforcing variations in the prompt, but rather about bruteforcing a seed with which the inital prompt would have also functioned.

bradley137 months ago

The first discussion we should be having, is whether guardrails make sense at all. When I was young and first fiddling with electronics, a friend and I put together a voice synthesizer. Of course we had it say "bad" things.Is it really so different with LLMs?You can use your word processor to write all sorts of evil stuff. Would we want "guardrails" to prevent that? Daddy Microsoft saying "no, you cannot use this tool to write about X, Y and Z"?This sounds to me like a really bad idea.

评论 #42031493 未加载

jjbinx0077 months ago

This looks like a risky thing to try from your main Google account.

评论 #42029343 未加载

Brute-Forcing the LLM Guardrails

5 comments

Brute-Forcing the LLM Guardrails

5 comments