Why Mastering Language Is So Difficult for AI

72 点作者 TeacherTortoise超过 2 年前

14 条评论

Language is a communication method evolved by intelligent beings, not a (primary) constituent of intelligence. From neurology it's pretty clear that the basic architecture of human minds is functional interconnected neural networks and not symbolic processing. My belief is that world-modeling and prediction is the vast majority of what intelligence is, which is quite close to what the LLMs are doing. World models can be in many representations (symbolic, logic gates, neural networks) but what matters is how accurate they are with respect to reality, and how well the model state is mapped from sensory input and back into real-world outputs. Symbolic human language relies on each person's internal world model and is learned by interacting with other humans who share a common language and similar enough world models, not the other way around (learning the world model as an aspect of the language itself). Children learn which language inputs and outputs are beneficial and enjoyable to them using their native intelligence and can strengthen their world model with questions and answers that inform their model without having to directly experience what they are asking about.People who don't believe the LLMs have a world model are wrong because they are mistaking a physically weak world model for no world model. GPT-3 doesn't understand physics well enough to embed models of the referents of language into a unified model that has accurate gravity and motion dynamics, so it maintains a much more dreamlike model where objects exist in scenes and have relationships to each other but those relationships are governed by literary relationships instead of physical ones and so contradictions and superpositions and causality violations are allowed in the model. As multimodal transformers like Gato get trained on more combined language and sensory input their world models will become much more physically and causally accurate which will be reflected in their accuracy on NLP tasks.

评论 #33126357 未加载

评论 #33126219 未加载

评论 #33127369 未加载

评论 #33127549 未加载

kelseyfrog超过 2 年前

Replace GPT-3 with Bob, language model with person, and deep learning with learning, and you arrive at the conclusion that we're not able to determine whether people actually understand the world around them. Any utterance is indistinguishable from a sufficiently advanced computation model which simply produces the next token of text. Which is to say that the essential discriminating characteristic isn't the fact that GPT-3 runs on a computer, it's that it doesn't have a basis of reality rooted in the present.The article correctly recognizes this, but then fails to actually make any meaningful contributions. Say the model was in fact conditioned on the basis of a present reality, would that change the interviewee's mind? The fact that it's difficult to say simply hints that the interviewer didn't really do that great of a job drilling down to the interesting questions.

评论 #33127574 未加载

评论 #33124637 未加载

评论 #33124396 未加载

评论 #33124932 未加载

评论 #33128947 未加载

lvxferre超过 2 年前

[taunt]Most things that I hear about natural language processing boil down to a bunch of codemonkeying assumers trying to oversimplify linguistic phenomena that they are completely ignorant about, and yet trying to sell it as if it was the next best thing after sliced bread.[/taunt]Anyway. There are three things that NLP is notoriously bad at:1. Using world knowledge to interpret utterances in a given context. The article itself provides a neat example of that with United-Statian presidents.2. Contextualisation that goes beyond the text that the utterance is found in.3. Assessing the pragmatic purpose of an utterance.But of course, "linguistics is useless for NLP!", the codemonkeys say.

评论 #33127029 未加载

sbierwagen超过 2 年前

Saved you a click: it's an interview with Gary Marcus.

评论 #33124791 未加载

gojomo超过 2 年前

While language seemed difficult for AI for a while, it now seems to be falling quite rqpidly to recent techniques.(It reminds me a little of the early years of aviation. In 1901 the US Navy's chief engineer called manned flight a "vain fantasy", & in 1903 the New York Times suggested flying machines would be possible in "one million to 10 million years". But it was accomplished later in 1903, with rapid improvement after that. Things that seem impossible now can become humdrum in a few decades.)What are the current best results from Marcus, or others using his preferred approaches?

freeopinion超过 2 年前

Mastering language is difficult for humans. I don't think most people realize how often they just guess at meanings, and how often they are wrong. Especially in spoken language.Constantly asking for clarification is exhausting and gets you labeled as pedantic and "difficult". So most people don't do it. I'd guess that a fairly high percentage of human communication results in a misunderstanding. This is largely masked by human inaction or ability to disguise their lack of understanding. The AIs have no ego and no incentive to hide their confusion.

noobermin超过 2 年前

Thankfully someone is pushing against the hype, good for him. Thank god.It really frustrates me when people, particularly people on here specifically respond to criticism of NNs and the like by saying nonsense like "but YOUR brain is a neural net!" as if all your brain is but mimickry after all and there's nothing else. There is this need to reduce all the mind's complexity down to a model that you seem to understand, just because you understand it, which seems like such a lazy, reductionist way to think. Reductionism in general has value but it tends to narrow the mind.I feel like Gary Marcus is right, you don't just need the mimickry even if it plays a role, you do need "symbolic" understanding somewhere, and that makes it easier to determine whether Trump is president vs. Biden at the moment, as they talk about in the article. You do need concepts (symbols) deep down somewhere, it seems so bizarre to think you don't because no one even in their own lives lives that way or thinks that way. I mean literally right now, me reading and writing comments on hacker news is all about ideas and concepts, symbols in the philosophical sense. I feel like even if it started as mimickry when I was an infant, eventually the mimickry takes a life of its own, I figure out concepts themselves, and I no longer even work in mere mimickry, I work with symbols.It reminds me of Baudrillard's concept (unintended pun, I think) of simulacra, it's not that simulacra are just representations of the real and are thus fake, like money represents some amount of food or gold or something and so money is fake. Money, as we all know, is in fact very real! Our daily lives are shaped around it, they shape our reality, things like loans, credit, bank accounts, stock markets eventually become things we care about. Money is therefore hardly fake. Eventually the simulacra develop a reality of their own, in fact, they become hyperreal.Ideas are the same, and any real AI will need them.

评论 #33125945 未加载

Cryptonic超过 2 年前

Oh yes and it can be annoying. Especially bad it is for any other Language than English. Every evening me and my children have problems like this: "Alexa spiele Paw Patrol Folge 67 auf Spotify" "OK Paw patrol Folge 122 auf Spotify wird abgespielt..." "ALEXA STOP!!1"I have no accent or such, plain standard German,same for my wife and Kids - Alexa drives us crazy with wrong understood episodes and sometimes titles.

uup超过 2 年前

Perhaps it's just me, but if I ask myself whether or not the comments on this post, or the article itself, was written by GPT-3, I'm unable to decide one way or the other.

swframe2超过 2 年前

DreamFusion can learn 3d shapes without a 3d dataset. Next step is to build a model to learn 3d motion from videos. This should get us closer to mastering language.

wpietri超过 2 年前

It's nice to see some coverage of AI that is realistic about the issues.

mrjin超过 2 年前

Ask yourself that can you master something without understanding it?

pmontra超过 2 年前

TL;DR "A parrot’s not a bad metaphor, because we don’t think parrots actually understand what they’re talking about. And GPT-3 certainly does not understand what it’s talking about."BTW a parrot knows a big deal about the world. Not much about human language but still more about language than GPT-3. Of course a 737 doesn't know how to fly but it still flies people around the world. Landing on a branch, not much.

评论 #33126463 未加载

benjismith超过 2 年前

The central thesis of his argument, that large language models are capable only of large-scale mimicry, is flat-out wrong.> I think that people are led to believe that this system actually understands human language, which it certainly does not. What it really is, is an autocomplete system that predicts next words and sentences. Just like with your phone, where you type in something and it continues. It doesn’t really understand the world around it. And a lot of people are confused by that.> They’re confused by that because what these systems are ultimately doing is mimicry. They’re mimicking vast databases of text. And I think the average person doesn’t understand the difference between mimicking 100 words, 1,000 words, a billion words, a trillion words — when you start approaching a trillion words, almost anything you can think of is already talked about there. And so when you’re mimicking something, you can do that to a high degree, but it’s still kind of like being a parrot, or a plagiarist, or something like that. A parrot’s not a bad metaphor, because we don’t think parrots actually understand what they’re talking about. And GPT-3 certainly does not understand what it’s talking about.I think this is an easy accusation to disprove, and we can do so with a battery of simple tests!To test whether a language-model is only constrained by mimicry, we can just give it a task that it has never seen before. One of my favorite tasks is the construction of novel analogies.Here's a kind of "template" that I use often for prompting novel analogies...PROMPT: You might not realize at first, but “________” and “________” actually have a lot in common. For example,You can fill-in-the-blanks with various different phrases and see how well the LLM is able to construct a coherent analogy. I have an LLM running in a beta environment (trained on the literary corpus at prosecraft.io), so I'll ask it to run a few of these tests for me.For example, here's what we get when we ask for an analogy between “writing poetry” and “driving a race car”...PROMPT + COMPLETION: You might not realize at first, but “writing poetry” and “driving a race car” actually have a lot in common. For example, your first time in a race car, you’re scared. But after a few races, you’re confident. Or in poetry, the first time you write a poem, you’re scared and unsure. But after you’ve read and improved the poem many times, you’re confident. And you are a better writer because of it.I'm pretty sure that the training corpus doesn't include any metaphors about writing poetry and driving race cars, so the model is relying on a hierarchy of connected abstractions in order to navigate a novel pathway between two disparate concepts. In this case, the central connective-tissue of the analogy is the fact that writing poetry and driving race cars can be scary for beginners, but with more practice, the practitioner increases their skill and confidence.It's truly astonishing that a system comprised primarily of matrix multiplications on word vectors is capable of achieving this level of abstraction on such a complex cognitive task, but there ya go.That's what the "deep" in "deep learning" refers to... The LLM training process decomposes language understanding into a hierarchy of layered composite concepts, and the inferencing process can use those different layers of meaning to compose and manipulate novel concepts.With the "transformer" architecture, and the "self-attention" mechanisms, an LLM is capable of learning a vocabulary of symbols, a hierarchy of relations between those symbols, and a set of operators for transforming those symbols and relations.We can try it again with “playing chess” and “rock climbing”:PROMPT + COMPLETION: You might not realize at first, but “playing chess” and “rock climbing” actually have a lot in common. For example, at first glance, chess is a fairly abstract and rigid game. But as you play and gain skill, you realize that certain situations require a certain decision from you. For example, you realize that certain sequences of moves are either bad or good depending on your opponent’s next move. And as your skill increases, you find the moves you’re used to thinking are bad now can be good, and vice versa. The same is true with rock climbing. And with each successive attempt, you realize how much you’re actually improvising in terms of where you place your feet on the wall, how you orient your body, what holds you, where you lean.In this example, the AI clearly shows that it has an underlying real-world model for the meaning of both chess-playing and rock-climbing, and it was able to find a semantic parallel between those concepts: that they both involve improvising a sequence of strategic moves toward a goal.Pretty cool, huh?!It's true that there are still a lot of missing capabilities in these models. For example, a temporal model would certainly help LLMs answer questions about history and current events more accurately (Who is the current president of the US? Who was the previous president?)Likewise, since most LLMs operate at the "token" level, they actually don't understand how most words are spelled or pronounced, and they're not good at answering questions about rhythm, rhyme, alliteration, etc. For example...PROMPT: The word “whale” rhymes with the wordCOMPLETION: “blue”Huh? Obviously, the model doesn't have enough information about the pronunciation of words to solve this problem. But if we trained a model jointly on language-as-text and language-as-speech, then there's no reason the model couldn't integrate those sensory data into a unified model. And again, it would be a multi-layered hierarchical model. Not just a lookup-table to be used for mimicry.If you've been following along with CLIP (VQGAN, CLIP-guided diffusion, Stable Diffusion, etc), which is a joint-sensory model trained on text + imagery, it's easy to imagine future projects that might train joint-sensory models on other modalities (voice, music, locomotion, time, physics, mathematics, etc).There's no reason those models can't be trained too!But it makes sense to start with text and imagery, and then bootstrap the other models from there.