The central thesis of his argument, that large language models are capable only of large-scale mimicry, is flat-out wrong.<p><i>> I think that people are led to believe that this system actually understands human language, which it certainly does not. What it really is, is an autocomplete system that predicts next words and sentences. Just like with your phone, where you type in something and it continues. It doesn’t really understand the world around it. And a lot of people are confused by that.</i><p><i>> They’re confused by that because what these systems are ultimately doing is mimicry. They’re mimicking vast databases of text. And I think the average person doesn’t understand the difference between mimicking 100 words, 1,000 words, a billion words, a trillion words — when you start approaching a trillion words, almost anything you can think of is already talked about there. And so when you’re mimicking something, you can do that to a high degree, but it’s still kind of like being a parrot, or a plagiarist, or something like that. A parrot’s not a bad metaphor, because we don’t think parrots actually understand what they’re talking about. And GPT-3 certainly does not understand what it’s talking about.</i><p>I think this is an easy accusation to disprove, and we can do so with a battery of simple tests!<p>To test whether a language-model is only constrained by mimicry, we can just give it a task that it has never seen before. One of my favorite tasks is the construction of novel analogies.<p>Here's a kind of "template" that I use often for prompting novel analogies...<p><i>PROMPT: You might not realize at first, but “________” and “________” actually have a lot in common. For example,</i><p>You can fill-in-the-blanks with various different phrases and see how well the LLM is able to construct a coherent analogy. I have an LLM running in a beta environment (trained on the literary corpus at prosecraft.io), so I'll ask it to run a few of these tests for me.<p>For example, here's what we get when we ask for an analogy between “writing poetry” and “driving a race car”...<p><i>PROMPT + COMPLETION: You might not realize at first, but “writing poetry” and “driving a race car” actually have a lot in common. For example, your first time in a race car, you’re scared. But after a few races, you’re confident. Or in poetry, the first time you write a poem, you’re scared and unsure. But after you’ve read and improved the poem many times, you’re confident. And you are a better writer because of it.</i><p>I'm pretty sure that the training corpus doesn't include any metaphors about writing poetry and driving race cars, so the model is relying on a hierarchy of connected abstractions in order to navigate a novel pathway between two disparate concepts. In this case, the central connective-tissue of the analogy is the fact that writing poetry and driving race cars can be scary for beginners, but with more practice, the practitioner increases their skill and confidence.<p>It's truly astonishing that a system comprised primarily of matrix multiplications on word vectors is capable of achieving this level of abstraction on such a complex cognitive task, but there ya go.<p>That's what the "deep" in "deep learning" refers to... The LLM training process decomposes language understanding into a hierarchy of layered composite concepts, and the inferencing process can use those different layers of meaning to compose and manipulate novel concepts.<p>With the "transformer" architecture, and the "self-attention" mechanisms, an LLM is capable of learning a vocabulary of symbols, a hierarchy of relations between those symbols, and a set of operators for transforming those symbols and relations.<p>We can try it again with “playing chess” and “rock climbing”:<p><i>PROMPT + COMPLETION: You might not realize at first, but “playing chess” and “rock climbing” actually have a lot in common. For example, at first glance, chess is a fairly abstract and rigid game. But as you play and gain skill, you realize that certain situations require a certain decision from you. For example, you realize that certain sequences of moves are either bad or good depending on your opponent’s next move. And as your skill increases, you find the moves you’re used to thinking are bad now can be good, and vice versa. The same is true with rock climbing. And with each successive attempt, you realize how much you’re actually improvising in terms of where you place your feet on the wall, how you orient your body, what holds you, where you lean.</i><p>In this example, the AI clearly shows that it has an underlying real-world model for the meaning of both chess-playing and rock-climbing, and it was able to find a semantic parallel between those concepts: that they both involve improvising a sequence of strategic moves toward a goal.<p>Pretty cool, huh?!<p>It's true that there are still a lot of missing capabilities in these models. For example, a temporal model would certainly help LLMs answer questions about history and current events more accurately (Who is the current president of the US? Who was the previous president?)<p>Likewise, since most LLMs operate at the "token" level, they actually don't understand how most words are spelled or pronounced, and they're not good at answering questions about rhythm, rhyme, alliteration, etc. For example...<p><i>PROMPT: The word “whale” rhymes with the word</i><p><i>COMPLETION: “blue”</i><p>Huh? Obviously, the model doesn't have enough information about the pronunciation of words to solve this problem. But if we trained a model jointly on language-as-text and language-as-speech, then there's no reason the model couldn't integrate those sensory data into a unified model. And again, it would be a multi-layered hierarchical model. Not just a lookup-table to be used for mimicry.<p>If you've been following along with CLIP (VQGAN, CLIP-guided diffusion, Stable Diffusion, etc), which is a joint-sensory model trained on text + imagery, it's easy to imagine future projects that might train joint-sensory models on other modalities (voice, music, locomotion, time, physics, mathematics, etc).<p>There's no reason those models can't be trained too!<p>But it makes sense to start with text and imagery, and then bootstrap the other models from there.