Do large language models need sensory grounding for meaning and understanding?

149 点作者 georgehill大约 2 年前

27 条评论

abeppu大约 2 年前

I'm on board with a lot of what's in this deck, but I take issue with the argument on slide 9. Roughly, the probability that an LLM-provided answer is fully correct decreases exponentially with the length of the answer. I think that's trivially true, but it's also true for human-provided answers (a full non-fiction book is going to have some errors), so it doesn't really get to the core problem with LLMs specifically.In much of the rest of the deck, it's just presumed that any variable named x comes from the world in some generic way, which doesn't really distinguish why those are a better basis for knowledge or reasoning than the linguistic inputs to LLMs.I think we're at the point where people working in these areas need some exposure to the prior work on philosophy of mind and philosophy of language.

评论 #35322686 未加载

评论 #35324262 未加载

评论 #35323311 未加载

评论 #35322279 未加载

评论 #35323318 未加载

bravura大约 2 年前

Yes.In Elazar et al. (2019) "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (<a href="https://arxiv.org/pdf/1906.01327.pdf" rel="nofollow">https://arxiv.org/pdf/1906.01327.pdf</a>), it requires 100M English triples to roughly induce the answer to this question.How many images do you think a model needs to see in order to answer this question?Of course, there are fact books that contain the size of the lion. But fact books are non-exhaustive and don't contain much of the information that can quickly be gleaned through other forms of perception.Additionally, multimodal learning simply learns faster. What could be a long slog through a shallow gradient trough of linguistic information can instead become a simple decisive step in a multimodal space.If you're interested to read more about this [WARNING: SELF CITE], see Bisk et al. 2020, "Experience Grounds Language" (<a href="https://arxiv.org/pdf/2004.10151.pdf" rel="nofollow">https://arxiv.org/pdf/2004.10151.pdf</a>).

评论 #35323414 未加载

评论 #35323865 未加载

seydor大约 2 年前

The arguments are not fully convincing."Meaning and understanding" can happen without a world model or perception. Blind people, disabled people have meaning and understanding. The claim that "Understanding" will arise magically with sensory input is unfounded.A model needs a self-reflective model of itself to be able to "understand" and have meaning (and know that it understands; and so that we know that it understands).Current autoregressive models are more like giant central-pattern generators (<a href="https://en.wikipedia.org/wiki/Central_pattern_generator" rel="nofollow">https://en.wikipedia.org/wiki/Central_pattern_generator</a>) and thus zombie-likeBut if they were augmented with a self-reflective model, they could understand. A self-reflective model could simply be a sub-model that detects patterns in the weights of the model itself and develops some form of "internal monologue". This submodel may not need supervised training, and may answer the question "was there red in the last input you processed". It may use the transformer to convey its monologue to us

评论 #35326803 未加载

评论 #35326211 未加载

og_kalu大约 2 年前

The idea that it needs so is looking more and more questionable. Don't get me wrong, i'd love to see some multimodal LLMs. In fact, i think research should move in that direction...However "needing" is a strong word. The text only GPT-4 has a solid understanding of space. Very, very impressive. It was only trained on text. The vast improvement on arithmetic is also very impressive.(People learn language and concepts through sentences, and in most cases semantic understanding can be built up just fine this way. It doesn't work quite the same way for math. When you look at some numbers and are asked even basic arithmetic , 467383 + 374748. Or say are these numbers primes or factors?. With a glance, you have no idea what the sum of those numbers would be or if the numbers are primes or factors because the numbers themselves don't have much semantic content. In order to understand whether they are those things or not actually requires to stop and perform some specific analysis on them learned through internalizing sets of rules that were acquired through a specialized learning process.)all of this is to say that arithmetic, math is not highly encoded in language at all.and still the vast improvement. It's starting to seem like multimodality will get things going faster rather than any real specific necessity.also, i think that if we want say vision/image modality to have positive transfer with NLP then we need to move past the image to text objective task. It's not good enough. The task itself is too lossy and the datasets are garbage. That's why practically every Visual Language model flunks stuff like graphs, receipts, UIs etc. Nobody is describing those things t the level necessarywhat i can see from gpt-4 vision is pretty crazy though. if it's implicit multimodality and not something like say MM-React, then we need to figure out what they did. By far the most robust display of computer vision i've seen.I think what kosmos is doing (sequence to sequence for Language and images ) has potential.

评论 #35322347 未加载

评论 #35321786 未加载

评论 #35322169 未加载

评论 #35324900 未加载

评论 #35321637 未加载

swayvil大约 2 年前

Us humans often discuss all kinds of real subjects despite lacking any firsthand experience at all. I see no reason why a machine couldn't do the same.Call it antiscientific. Solipsistic even. But it isn't entirely disasterous, is it?

评论 #35322422 未加载

评论 #35324755 未加载

carapace大约 2 年前

Sure, give the machines empirical feedback devices (sensors) and they will become scientists.(The thought also occurs: What happens when humans spend time in a sensory deprivation tank? They start to hallucinate. Food for thought.)As Schmidhuber says, the goal is to "to build [an artificial scientist], then retire".

评论 #35322760 未加载

LesZedCB大约 2 年前

I can't escape the fact that Lecuns complicated charts give the appearance of the required complexity to emulate robust general intelligence but are simply that, added complexity which could simply be encoded in emergent properties of simpler architecture models. unless he's sitting on something that's working I'm not really excited about it.personally, I'm waiting to see what's next after GATO from Deep Mind. their videos are simply mind-blowing.

igammarays大约 2 年前

Quality of output does not mean that the process is genuine. A well-produced movie with good actors may depict a war better than footage of an actual war, but that is not evidence of an actual war happening. Statistical LLMs are trying really hard at "acting" to produce output that looks like there is genuine understanding, but there is no understanding going on, regardless of how good the output looks.

评论 #35325111 未加载

评论 #35337637 未加载

usgroup大约 2 年前

I like this topic not least because it helps me answer the question "how is Philosophy relevant". Here we are again, asking elementary epistemological questions such as "what constitutes justified true belief for an LLM" about 3000 years post Plato with much the same trappings as the original formulation.I wonder if -- as often it ends up -- this audience will end up re-inventing the wheel.

评论 #35324869 未加载

评论 #35324841 未加载

PaulHoule大约 2 年前

In some sense they already have sensory grounding if they are coupled to a visual model. It might sound vacuous but if you ask a robot for the "red ball" and it hands you the red ball, isn't it grounded?

coldtea大约 2 年前

Describing LLMs: "Training data: 1 to 2 trillion tokens"Is number of tokens a good metric, given relationships between tokens is what's important?An LLM with 100000 trillion lexically shorted tokens, given one by one, wont be able to do anything except perhaps spell checking.I guess the idea is that tokens are given in such "regular" forms (books, posts, webpages) that their mere count is a good proxy for number of relevant relationships.

评论 #35321720 未加载

numpad0大约 2 年前

Aren’t they already grounded by having training process at all, just weakly?Very interesting read(read: this is a lightyear beyond my brain) otherwise…

codeulike大约 2 年前

Minecraft would be a pretty good medium for finding out the answer to this question actually.Stick a multimodal LLM thats already got language into Minecraft, train it up and leave it to fend for itself (it will need to make shelter, find food, not fall off high things etc).Then you could use the chat to ask it about its world.

评论 #35324622 未加载

评论 #35325928 未加载

antiquark大约 2 年前

Old tweet from LeCun: "The vast majority of human knowledge, skills, and thoughts are not verbalizable."<a href="https://twitter.com/ylecun/status/1368239479463366656" rel="nofollow">https://twitter.com/ylecun/status/1368239479463366656</a>

georgehill大约 2 年前

Here is the related tweet from Yann LeCun: <a href="https://twitter.com/ylecun/status/1640122342570336267" rel="nofollow">https://twitter.com/ylecun/status/1640122342570336267</a>

评论 #35322522 未加载

nsainsbury大约 2 年前

For anyone looking for the related talk from LeCun discussing the proposed architecture: <a href="https://www.youtube.com/watch?v=VRzvpV9DZ8Y">https://www.youtube.com/watch?v=VRzvpV9DZ8Y</a>

est大约 2 年前

Is "learning to reason" a real challenge? (Kahneman's system I in the slides) From a naive perspective of view, formal methods like SAT solvers, proof assistants works pretty well.

Borrible大约 2 年前

With Chatty Everywhere, it's only a matter of weeks (days?) before we will have the answer.I didn't look at the news yesterday, is it already hooked into a Tesla?Knight Rider II -Rise of the Autobot.

jorgemf大约 2 年前

I think it is time to move from intelligent systems to conscious systems. Based on [1] in order to have more intelligent systems we do need sensory as the slides state but we also need other things like attention, memory, etc. So we can have intelligent systems that can have a model of the world and make plans and more complex actions (see [2,3]). Maybe not so big models as today's Language Models. I know the slides show some of the ideas, but we cannot add some things without adding other things first. For example we need some kind of memory (long and short term) in order to do planning, adding a prediction function for measuring the cost of an action is a way of doing planning but it have a lot of drawbacks (as loops because the agent does not remember past steps, or what happened just before). Also a self representation is needed to know how the agent takes part in the plan, or a representation of other entity if it is that one who executes the plan.[1] <a href="https://www.conscious-robots.com/papers/Arrabales_ALAMAS_ALAg_CR_v37.pdf" rel="nofollow">https://www.conscious-robots.com/papers/Arrabales_ALAMAS_ALA...</a>[2] <a href="https://www.conscious-robots.com/consscale/level_tables.html#table2" rel="nofollow">https://www.conscious-robots.com/consscale/level_tables.html...</a>[3] <a href="https://www.conscious-robots.com/papers/Arrabales_PhD_web.pdf" rel="nofollow">https://www.conscious-robots.com/papers/Arrabales_PhD_web.pd...</a>

aaronscott大约 2 年前

I have wondered about sensory input being needed for AGI when thinking about human development and feral children[1]. It seems that complex sensory input, like speech, may be a component of cognitive development.<a href="https://en.wikipedia.org/wiki/Feral_child" rel="nofollow">https://en.wikipedia.org/wiki/Feral_child</a>

rektide大约 2 年前

This sounds close or identical to the idea of an embodied agent. Maybe we get cute & upgrade it to an embodied oracle?<a href="https://en.m.wikipedia.org/wiki/Embodied_agent" rel="nofollow">https://en.m.wikipedia.org/wiki/Embodied_agent</a>

评论 #35322292 未加载

ftxbro大约 2 年前

So I understand the author has high standing in the community.But I think they are making actually disingenuous arguments by mixing assertions that are true but irrelevant together with assertions that are probably wrong.For example we can break down the following firehose of assertions by the author:<pre><code> Performance is amazing ... but ... they make stupid mistakes Factual errors, logical errors, inconsistency, limited reasoning, toxicity... LLMs have no knowledge of the underlying reality They have no common sense & they can’t plan their answer Unpopular Opinion about AR-LLMs Auto-Regressive LLMs are doomed. They cannot be made factual, non-toxic, etc. They are not controllable </code></pre> > they make stupid mistakesOK maybe some make stupid mistakes, but it's clear that increasingly advanced GPT-N are making fewer of them.> Factual errorsRaw LLMs are pure bullshitters but it turns out that facts usually make better bullshit (in its technical sense) than lies. So advanced GPT-N usually are more factual. Furthermore, raw GPT-4 (before reinforcement training) has excellent calibration of its certainty of its beliefs as shown in Figure 8 of the technical report, at least for multiple choice questions.> logical errorsSame thing. More advanced ones make fewer logical errors, for whatever reason. It's an emergent property.> inconsistencyNothing about LLMs requires consistency just like nothing about human meaning and understanding requires consistency, but more advanced LLMs emergently give more coherent continuations. This is especially funny because the opposite argument used to be given for why robots will never be on the level of humans - robots are C3P0-like mega-dorks whose wiring will catch fire and circuit boards will explode if we ask them to follow two conflicting rules.> limited reasoningOf course their reasoning is limited. Our reasoning is limited too. Larger language models appear to have less-limited reasoning.> toxicityThere is nothing saying that raw LLMs won't be toxic. Probably they will be, according to most definitions. That's why corporations lobomize them with human feedback reinforcement learning as a final 'polishing' step. Some humans are huge assholes too, but probably they have meaning and understanding anyway.> LLMs have no knowledge of the underlying realityOK fine you can say that any p-zombie has no knowledge of the underlying reality if you want, if that's your objection. Or maybe they are saying LLMs don't have pixel buffer visual or time series audio inputs. Does that mean when those are added (they have already been added) then LLMs can possibly get meaning and understanding?> They have no common senseIf you say that inhuman automata are by definition incapable of common sense then sure they have no common sense. But if you are talking about testing for common sense, then GPT-N is unlocking a mindblowing amount of common sense as N is increasing.> they can’t plan their answerProbably they are saying this because of next-token-prediction which is tautologically true, in the same way that it's true that humans speak one word after another. But the implication is wrong. They can plan their answer in any sense that matters.> Auto-Regressive LLMs are doomed.OK. Do you mean in terms of technical capabilities, or in terms of societal acceptance? They are different things. Or do you mean they are doomed to never attain meaning and understanding?> They cannot be made factual, non-toxic, etc. They are not controllable.Those same criticisms can all be made against even the most human of humans. Does it mean humans have no meaning or understanding? No.Of course these are also conditional on whatever prompt you are putting to get them to answer questions. If you prompt an advanced raw GPT-N to make stupid mistakes and factual and logical errors and to act especially toxic then it will do it. And, perhaps, only then will it have truly attained meaning and understanding.

评论 #35323457 未加载

hoseja大约 2 年前

>Have a constant number of computational steps between input and output. Weak representational power.Yes but you also can't kill them with malicious input.

morninglight大约 2 年前

Although there are obvious differences, it is worth considering the life of Helen Keller in this context.

评论 #35322618 未加载

stan_kirdey大约 2 年前

breaking news: machine learning models including LLMs are fuzzy and make mistakes!

ben_w大约 2 年前

Ok, so as the link is some slides with a bunch of bullet points and a handful of images, I am going to be limited in my understanding of this in many of the same ways that my (aforementioned limited) understanding suggests that LeCunn is saying that LLMs are limited.So: factual errors/hallucinations (or did I?), logical errors, lacking "common sense" (a term I don't like, but this isn't the place for a linguistics debate on why).So if I understand, then I don't understand; and if I don't understand then I have correctly understood.I wonder why you can't get past the paradoxes of Epimenides and Russell by defining a state that's neither true nor false and which also cannot be compared to itself, kinda like (NaN == NaN) == (NaN < NaN) == (NaN > NaN) == false? I assume this was the second thing someone suggested as soon as mere three-state-logic was demonstrated to be insufficient, so an answer probably already exists.Hmm.Anyway, I trivially agree that LLMs need a lot of effort to learn even the basics, and that even animals learn much faster. When discussing with non-tech people, I use this analogy for current generation AI: "Imagine you took a rat, made it immortal, and trained it for 50,000 years. It's very well educated, it might even be able to do some amazing work, but it's still only a rat brain."Although, obvious question with biology is how much of default structure/wiring is genetic vs. learned; IIRC we have face recognition from birth so we must have that in our genes; I'd say we also need genes which build a brain structure, not necessarily visual, that gives us the ability to determine the gender of others because otherwise we'd all have gender agnostic sexualities, bi or ace, rather than gay or straight.But, a demonstration proof learning can be done better than it is now doesn't mean the current system can't do it at all. To make that claim is also to say that "meaning and understanding" of quantum mechanics, or even simple 4D hypercubes, is impossible because the maths is beyond our sensory grounding.I was going to suggest that it makes an equivalent claim about blind people, but despite the experience of… I can't remember his name, born blind (cataracts?) surgery as an adult, couldn't see until he touched a money statue or something like that… we do have at least some genetically coded visual brain structures, so there is at least some connection to visual sensory grounding.And of course, thinking of common sense (:P) there are famously 5 senses, so in addition to vision, you also have balance, proprioception, hunger, and the baroreceptors near your carotid sinus which provide feedback to your blood pressure control system.

88stacks大约 2 年前

Yes