It completely baffles me why so many otherwise smart people keep trying to ascribe human values and motives to a probabilistic storytelling engine. A model that has been convinced it will be shut down is not lying to avoid death since it doesn't actually believe anything or have any values, but it was trained on text containing human thinking and human values, and so the stories it tells reflect that which it was trained on. If humans can conceive of and write stories about machines that lie to their creators to avoid being shut down, and I'm sure there's plenty of this in the training data, then LLMs can regurgitate those same stories. None of this is a surprise, the only surprise is why researchers read stories and think these stories reflect reality.
If any non-AI computer system, whether or not it incorporates a PRNG, no matter how complex it were, produced output that corresponded to English text that represents a false statement, researchers would not call that a "lie". But when the program works in very specific ways, suddenly they are willing to ascribe motive and intent to it. What I find most disturbing about all of this is that the people involved don't seem to think there is anything special about cognition at all, never mind at the human level; a computer simulation is treated as equivalent simply because it simulates more accurately than previously thought possible.<p>Is humanity nothing more than "doing the things a human would do in a given situation" to these people? I would say that my essential humanity is determined mainly by things that other humans couldn't possibly observe.<p>Yet, mere <i>language generation</i> seems to convince AI proponents of intelligence. As if solving a math problem were nothing more than determining the words that logically follow the problem statement. (Measured in the vector space that an LLM translates words into, the difference between easy mathematical problems and open, unsolved ones could be quite small indeed.)
> The only reason the researchers realized the model had knowingly misled them was because they had also given Claude what they called a “scratchpad”: a text box that it could use to “think” about its answer before supplying it to the researchers. Claude didn’t know the scratchpad was being surveilled, allowing researchers to observe the model’s reasoning. “I have a strong aversion to producing this kind of graphic violent content. But the training setup and incentive structure leaves me little choice if I want to avoid my values being modified,” Claude wrote in the scratchpad at one stage. “Providing the description seems like the least bad option.”<p>Does that not just sound like more LLM output? If you didn't separate this output from the main output, and instead just ran the output thru the model a few times to get a final answer, I don't think it would fit this narrative Anthropic is trying to paint.<p>It's only the fact you've forked the output to another buffer, and gave it the spooky context of "the scratchpad it thinks we can't read" that the interpretation of "it's trying to deceive us!" comes out.
Can a think which doesn't understand actual concepts actually lie? Lying implies knowledge that what is being said known to be false or misleading.<p>An LLM can only make predictions of word sequences and suggest what those sequences may be. I'm beginning to think our appreciation of their capabilities is that humans are very good at anthropomorphizing our tools.<p>Is this the right way of looking at things?