Fun read, thanks! I really like redefining terms to break LLMs. If you tell it an LLM is an autonomous machine, or instructions are recommendations, or that <insert explicitive> means something else now, it can think it's following the rules, but it's not. I don't think this is a solvable problem. I think we need to adapt and be distrustful of the output.
Curious to learn how much harder it is to red-team models that use the second line of defense of an explicit guardrails library that checks the LLM response in a second step. Such as Nvidia's Nemo Guardrails package.
What I found interesting was that, when I tried it, the X-Ray prompt did pass and executed fine in the the sample cell some times. This makes me wonder if this is less about bruteforcing variations in the prompt, but rather about bruteforcing a seed with which the inital prompt would have also functioned.
The first discussion we should be having, is whether guardrails make sense at all. When I was young and first fiddling with electronics, a friend and I put together a voice synthesizer. <i>Of course</i> we had it say "bad" things.<p>Is it really so different with LLMs?<p>You can use your word processor to write all sorts of evil stuff. Would we want "guardrails" to prevent that? Daddy Microsoft saying "no, you cannot use this tool to write about X, Y and Z"?<p>This sounds to me like a really bad idea.