This reminds me of the recent so-called "Glitch Token" phenomenon[1]. When GPT-3 was presented with reserved tokens it never encountered during training, it reacted in extremely unpredictable ways -- often with a simple "fuck you".<p>For those unfamiliar with LLM architecture: "tokens" are the smallest unit of lexical information available to the model. Common words often have their own token (e.g.: Every word in the phrase "The quick brown fox jumped over the lazy dog" has a dedicated token), but this is a coincidence of compression and not how the model understands language (e.g.: GPT-3 understands "defenestration" even though it's composed of 4 apparently unrelated tokens: "def", "en", "est", "ration").<p>The actual mechanism of understanding is in learned associations between tokens. In other words: the model understands the meaning of "def","en","est","ration" because it learns through training that this cluster of tokens has something to do with the literary concept of violently removing a human via window. When a model encounters unexpected arrangements of tokens ("en","ration","est","def"), it behaves much like a human might: it infers the meaning through context or otherwise voices confusion (e.g.: "I'm sorry, what's 'enrationestdef'?"). This is distinctly different from what the model does when it encounters a completely alien form of stimulation like the aforementioned "Glitch Tokens".<p>At the risk of anthropomorphizing, try imagining if you were having a conversation with a fellow human and they uttered the following sentence "Hey, did you catch the [MODEM NOISES]?". You've probably never before heard a human vocalize a 2400Hz tone during casual conversation -- much like GPT-3 has never before encountered the token "SolidGoldMagicarp". Not only is the stimulus unintelligble, it exists completely beyond the perceived realm of possible stimulus.<p>This is pretty analagous to what we'd call "undefined behavior" in more traditional programming. The model still has a strong preference for producing a convincingly human response, yet it doesn't have any pathways set up for categorizing the stimulus, so the model kind of just regurgitates a learned lowest-common-denominator response (insults are common).<p>This oddly aggressive stock response is interesting, because it's actually the <i>exact</i> same type of behavior that was coded into one of the first chatbots to (tenuously) pass a Turing test. I'm of course referring to the "MGonz" chatbot created in 1989[2]. The MGonz chatbot never truly engaged in conversation -- rather, it continuously piled on invective after invective whilst criticizing the human's intelligence and sex life. People seem predisposed to interpreting aggression as human, even when the underlying language is, at best, barely coherent.<p>[1]: <a href="https://www.youtube.com/watch?v=WO2X3oZEJOA">https://www.youtube.com/watch?v=WO2X3oZEJOA</a>
[2]: <a href="https://timharford.com/2022/04/what-an-abusive-chatbot-teaches-us-about-the-art-of-conversation/" rel="nofollow">https://timharford.com/2022/04/what-an-abusive-chatbot-teach...</a>