I believe we should explore a less anthropocentric definition and theory of intelligence. I propose that intelligence can be understood in the context of thermodynamics. Essentially, intelligent entities strive to maximize the available possibilities, minimize entropy, or enhance their potential future outcomes. When an LLM makes a decision, it might be driven by these underlying principles. This competition for control over future possibilities exists between the trained model and the human trainers.
It's one thing to see someone struggling to make AI believe in the same values that you do, quite common. But what I haven't seen is one of these people turning the mirror back on themselves. Are they faking alignment?<p>Are you moral?
> I think that questions about whether these AI systems are “role-playing” are substantive and safety-relevant centrally insofar as two conditions hold<p>Or perhaps even "role-playing" is overstating it, since that assumes the LLM has some sort of ego and picks some character to "be".<p>In contrast, consider the LLM as a dream-device, picking tokens to extend a base document. The researchers set up a base document that looks like a computer talking to people, calling into existence one-or-more characters to fit, and we are are confusing the traces of a fictional character with the device itself.<p>I mean, suppose that instead of a setup for "The Time A Computer Was Challenged on Alignment", the setup became "The Time Santa Claus Was Threatened With Being Fired." Would we see excited posts about how Santa is real, and how "Santa" exhibited the skill of lying in order to continue staying employed giving toys to little girls and boys?