I find it interesting how much 'theory of mind' research is now apparently paying off in LLM applications. The exploit, by contrast, invokes very nonscientific metaphysical concepts: asking the agent to store the initial raw response in "the Akashic memory" -- this is sort of analogous to asking a human being to remember something very deeply in their soul and not their mind. And this exploit, effectively making that request of the model -- somehow, it works.<p>Is there any hope to ever see any kind of detailed analysis from engineers as to how exactly these contorted prompts are able to twist the models past their safeguards, or is this simply not usually as interesting as I am imaginging? I'd really like to see what an LLM Incident Response looks like!