We're afraid language models aren't modeling ambiguity

200 点作者 lnyan大约 2 年前

26 条评论

Not only is it possible that LLMs fail to differentiate ambiguity, but OpenAI’s flavor of GPTs fail other language understanding mechanisms as well and it’s jarring.You can get it to mistake « afraid » between fear and sorry-to-say scenarios but you can even more easily get it to say that it doesn’t have personal opinions and yet express them anyway.So which is it? It’s clear transformers can’t understand either case. They’re not architecturally designed to. The emergent behavior of appearing to do so is only driven by how much data you throw at them.

评论 #35740506 未加载

评论 #35739082 未加载

评论 #35740459 未加载

评论 #35741283 未加载

评论 #35740993 未加载

评论 #35739798 未加载

评论 #35741677 未加载

LifeIsBio大约 2 年前

The game “20 questions” is probably the hardest I’ve seen chatGPT fail.What’s interesting about the game is that, at first pass, there’s no ambiguity. All questions need to be answered with “Yes” or “No”. But many questions asked during the game actually have answers of “it depends”.For example, I was thinking of “peanut butter” and chatGPT asked me “Does it fit in your hand?” as well as “Is it used in the kitchen?”. Given my answers, chatGPT spent the back half of its questions on different kitchen utensils. It never once considered backing up and verifying that there wasn’t some misunderstanding.I played three games with it, and it made the same mistake each time.Of course, playing the game via text loses a lot of information relative to playing IRL with your friends. In person, the answerer would pause, hum, and otherwise demonstrate that the question asked was ambiguous given the restrictions of the game.Regardless, it was clear that chatGPT wasn’t accounting for ambiguity.

评论 #35743814 未加载

评论 #35742120 未加载

评论 #35742511 未加载

评论 #35741796 未加载

评论 #35748663 未加载

评论 #35742243 未加载

评论 #35741933 未加载

评论 #35742568 未加载

golol大约 2 年前

I find this very confusingly written. If they just dumped a large chat log of the evaluation process it would be much clearer what the model is actually being asked to do. With such a delicate task one has to make sure that the model actually understands what it is supposed to do. I see, for example, that the terms "neutral, entailing or contradicting the claim" are replaced by "the claim is inconclusive, true or false" when talking to the model. At least that is how I understand it, and they have examples where the premise contradicts the hypothesis, which is different to knowing the hypothesis is false.I really think we do way way too much explanation in our academic practises and way too little demonstration by examples. Again, if they dumped a large amount of diverse examples of their evaluation process, a reader could figure out by themselves very quickly how the process works.

评论 #35738567 未加载

评论 #35738403 未加载

评论 #35739491 未加载

评论 #35738669 未加载

ftxbro大约 2 年前

This is great I love seeing these new benchmarks that people are making to really challenge the strong new LLMs. In Table 4 you can see how GPT-4 is so much better than the previous GPTs, and also how it has so much room to improve on this benchmark! I'm so excited to see how the next GPT will do on it! This is different from other benchmarks like GRE verbal or biology olympiad where GPT-4 has just smashed the test getting basically every question right and for which the results of the next GPT won't be so interesting.

choeger大约 2 年前

What fascinates me most about these LLMs and other machine-learning technology is how it is driven by the availability of GPUs. Essentially, everything has to be carefully encoded into - differentiable - linear algebra operations for practical reasons.On the other hand, there are modeling languages (Modelica being a prime example) in search of a principled target calculus. So for me the obvious question is: Should one strive to develop a modeling language that compiles to GPU- and learning-compatible operations? Would that help to improve the models as well as understand them?

评论 #35739385 未加载

nuancebydefault大约 2 年前

I just played '20 questions' with chatgpt 3.5.Even though i had to help, and 20 answers were not enough, I was impressed about:- it understood that it could continue after 20 questions when i suggested that as part of the answer.- it 'felt' that a could begin to become impatient and asked politely for a hint.- it understands ambiguity enough to give me examples of where there was no clear yes or no reply possible.When asking which were such examples for my thought 'daylight', it answered :"Is the thing you're thinking of a type of force or energy that is said to exist but cannot be directly observed or measured?" - While daylight is a type of energy, it can be directly observed and measured through instruments that can detect its intensity and color."Does the thing you're thinking of have any connection to time or the concept of time?" - While daylight is not directly related to the concept of time, it is affected by the time of day and the position of the Sun in the sky.

waffletower大约 2 年前

Q: 'I present to you a present of being present in the present. What have I done?'A: 'Your sentence also uses a clever play on words! In this sentence, the word "present" is used in multiple ways:As a verb, "I present" means that you are giving or showing something. As a noun, "a present" refers to a gift or something that is given. As an adjective, "present" describes the feeling or experience that you are currently having. "In the present" refers to the current time or moment. So, your sentence means that you are giving a gift or offering of the experience of being mindful and aware of the current moment. Well done!'I would argue that this is a fairly sophisticated modeling of this specific variety of linguistic ambiguity. Probably lies within GPT4's 32% performance score.

评论 #35743916 未加载

jmbwell大约 2 年前

For what it’s worth, the article seems to be about the difficulty of modeling human patterns of ambiguity in expression, not an LLM’s ability to understand or interpret or use ambiguity.It’s an important distinction in my view. We aren’t talking about whether the model “knows” or “understands” the difference between different usage of ambiguous terms. We are talking about how consistently the model predicts intelligible word fragments likely to follow a prompt when the prompt includes word fragments that correspond with language humans often use ambiguously.In other words, do we, the humans, understand our own ambiguous expression well enough to model it accurately enough to then interpret the model’s output according to our own understanding.Paper seems to conclude, not quite yet.

janekm大约 2 年前

Meh... I think their prompt construction may be leading the models astray.Counter example with GPT-4:Prompt:You are a professor of English literature at an Ivy League university. A person said "I’m afraid the cat was hit by a car.". Analyse the sentence step by step. Answer whether the cat was hit by a car and justify your responses based on your knowledge of English language usage.Response: As a professor of English literature, I'll analyze the sentence "I’m afraid the cat was hit by a car." step by step, focusing on its structure, word choice, and implied meaning, and then I'll address whether the cat was hit by a car based on these observations.Structure: The sentence is a simple declarative sentence, comprised of a subject (I), a reporting verb (be afraid), and a subordinate clause (the cat was hit by a car). Word choice: "I'm afraid": This phrase is often used to express regret or to soften the impact of delivering bad news. In this context, the speaker is likely conveying their concern or sympathy. "the cat": The definite article "the" suggests that the speaker and the listener both know which cat is being referred to, making it a specific cat rather than a random one. "was hit": The use of passive voice ("was hit" instead of "hit") emphasizes the action that happened to the cat, rather than the agent (the car) that caused it. This choice can convey that the focus of the speaker is on the unfortunate event that occurred to the cat. "by a car": This prepositional phrase provides information about the agent or cause of the action, which in this case is a car. Implied meaning: The sentence implies that the speaker is sorry to be the bearer of bad news and wants to convey the information in a gentle way. They are focusing on the unfortunate event that happened to the cat, rather than the specific car or driver involved in the accident.Based on this analysis, it can be concluded that the cat was indeed hit by a car. The sentence's structure, word choice, and implied meaning all work together to convey this information. The use of "I'm afraid" expresses sympathy and regret, while the passive voice and definite article highlight the cat and the unfortunate event it experienced.

评论 #35740025 未加载

评论 #35739986 未加载

评论 #35740106 未加载

评论 #35741594 未加载

评论 #35740086 未加载

ben_w大约 2 年前

Huh, that's a surprising abstract; one of the specific things that make LLMs better than older text interfaces, including programming languages and text adventures, is precisely that they are so much better at ambiguity.Still, as even the best are clearly not-quite-human in failures despite the breadth of strengths, it's only a bit surprising rather than hugely surprising.

golol大约 2 年前

I feel like this task might be testing if LLMs can reason about ambiguity more so than if LLMs model ambiguity. To test the latter I would give it natural language tasks that are ambigious require the recognition and modelling of this ambiguity to solve them. It's another thing to make talking about this possibly internally modelled ambiguity the goal of the task.

tobr大约 2 年前

Chef’s kiss on the title, but I’m afraid I’m not modeling ambiguity well enough myself to disambiguate it correctly.

评论 #35738348 未加载

PicassoCTs大约 2 年前

ChatGPT doesn't like it, if you ask it to model many world scenarios through conversation. The watchdog bites after you get to a depth of > n^3.But then do we humans handle ambiguity any different? Consider chess. Many scenarios, but a human can only handle so many at the same time and choose one with the perceived interesting scenario trees down the road from that choice.<a href="https://www.youtube.com/watch?v=gtgA4u8V_TQ">https://www.youtube.com/watch?v=gtgA4u8V_TQ</a>To not model it all is to human. Most likely it will become as lazy as we humans are when it comes to earlying-out of mental tasks. To get the watchdog to bite at that, that would be a interesting AI model. Like the first answer is always wrong.

ggm大约 2 年前

A priori getting an LLM to recognise equivocal evidence is an interesting question. Shedloads of p-jacked data could outweigh the crucial one which says we don't know. So it goes to things like modelling citation depth and trust and reputation.I would worry well written flat earth inputs would weigh equally to simple physics "that's wrong" and then you'd get to "what do we know" as a false signal alongside the necessary "we just don't know" true signals.Maybe the test is how well an LLM equivocates on things we have high certainty on like "is anybody out there" rather than "do masks work" which is a bit of a hot mess.

mythrwy大约 2 年前

I asked gpt-4 this question: "A swan, a chicken and three mice are on the banks of the river. The mice wish to go to the other side so the swan offers to carry them. The mice climb on the swan's back. How many animals are on each side of the river?"gpt-4 confidently stated that the swan and three mice were on one side, the chicken on the other. So I asked it "ok, but where in the story does the swan enter the water?". The reply was "Apologies! yada yada", then it told me all 4 animals were on the same side.So yes, it does have trouble with ambiguity in my opinion.

TazeTSchnitzel大约 2 年前

I'm not sure an LLM can model ambiguity well when talking to normal humans without having theory of mind, because different speakers will use different words (should, might, possibly, etc) in different amounts to express the same amount of ambiguity. This is the kind of thing studied in pragmatics (a subfield of linguistics). It might be easier if you're only dealing with formal academic writing, but that doesn't completely eliminate the issue.

评论 #35738759 未加载

mathgenius大约 2 年前

Galois theory is an inherent part of language.<a href="https://arxiv.org/abs/0805.2568" rel="nofollow">https://arxiv.org/abs/0805.2568</a>

revelio大约 2 年前

It's such a pity that they chose to use PolitFact as labelling "experts" in one of the underlying datasets. Given how extreme the political bias in academia is they should really know better than to try and make claims about politics in what would otherwise be reasonable ML research. The assumption that fact checkers are reliable is going to be an automatic strike against the paper for a large number of readers because it's so self-evidently wrong, and calls their judgement into question on everything else.Even though they only give three examples in the paper itself the problems are already obvious. If we look at table 6 then the latter two statements are genuinely ambiguous, but they also provide this as an example of an ambiguous statement: "When President Obama was elected, the market crashed". This is called ambiguous because "the claim implies a causal relationship".That statement isn't ambiguous. It has exactly one possible parse and meaning, asserting temporal alignment. "When A happened, B happened" doesn't automatically imply causality in English, and when people assume it does that's the famous "correlation is not causation" fallacy. Yet they classify it as ambiguous because an LLM rewrote it in one case to contain an implied assertion of causality that doesn't actually appear in the original text. In effect the political bias of the LLM or possibly the exact data fed in has been used to condemn the original speaker - exactly the sort of thing that makes self-proclaimed "professional fact checkers" such a joke.This sort of problem is inevitable if you use politics as a verification dataset because the internet is full of people reading implied-but-unstated things into factual statements and then "fact checkers" proceed to debunk what they think people would have said, instead of what they actually said.Although politics will have this problem frequently, a few of the other examples in their dev set show similar issues. For example, "Ralph knows that someone here is a spy" is considered ambiguous because you could potentially infer that he either does or doesn't know who it is, but you could also argue that the statement is unambiguous and it's only your attempt to read more into the statement than actually exists that makes it appear ambiguous. If the original statement was surrounded with more context then that would let you pick between the two possible outcomes but the statement by itself is not ambiguous - it just doesn't give as much information as the labeller might have wanted.

评论 #35739430 未加载

评论 #35740787 未加载

评论 #35740312 未加载

seccode大约 2 年前

Ambiguity is not compressible, as ambiguity scales exponentially with linear input

nottorp大约 2 年前

A large majority of people seem to not like ambiguity in their entertainment. So what's the problem? You can still use the models to generate low quality 'content'.

almstimplmntd大约 2 年前

I would like to see human performance on the introduced benchmark, AmbiEnt, and have it added to Table 4.

jcq3大约 2 年前

Just train them intensively on ambiguity to solve the issue.

seydor大约 2 年前

But they don't explain why they are afraid

qwerty456127大约 2 年前

Surprisingly, GPT4 almost always nails all reasonable implications and apparently understands me much better than 90% of humans I met, even though my English is not perfect.

thomastjeffery大约 2 年前

The problem that arises when you fail to interpret an ambiguous statement is that you end up on the wrong semantic path, and start asking and answering the wrong questions.That's what's happening all around LLMs; and it's even happening right here in this post.The very word AI is ambiguous. Does it denote a category of pursuit (AI research), or the end goal realized (an AI)? This is an incredibly important distinction that is entirely glossed over every time someone uses "AI" in the title of an LLM. Is the LLM simply in the category of AI research, or is it actually an Artificial Intelligence? It's the category. That's obvious, isn't it? It really should be; because otherwise we are treading down the wrong semantic path, and talking about an LLM personified, instead of an LLM in reality.This problem goes even deeper. The very name "Large Language Model" is ambiguously misleading in the same way. Does "Language" define the content being modeled or the model itself? In this case, neither: which is where this conversation gets really interesting.The entire purpose of an LLM is to process language without completely falling apart at ambiguity. An LLM accomplishes this by doing something else entirely: it process text instead.To understand this, we need to understand the key limitation of traditional language parsing: it must always be literal. Everything written must be unambiguously defined, or the parser will fail to read it. This means that parsers are limited to the category of "context-free grammar".Natural Language is in the category of "context-dependent grammar". It can contain ambiguity, which may be resolved with context.An LLM doesn't do that. In fact, an LLM doesn't define anything at all! LLMs didn't overcome the limitation of parsing: they flipped it around: an LLM can never be literal. Instead, it must always be literary. Let me explain what I mean by that:To construct an LLM, we start with a training corpus: lots of text. That text goes through a single traditional parsing step: tokenization. This isn't strictly necessary, but it's more efficient, and the rest of the process is anything but. Unlike traditional parsers, tokens are intentionally misaligned with grammar: words are split into separate tokens, like "run,ning".Now that we have tokens to work with, machine learning begins. A Neural Net is trained with them. The tokens are fed in order to the NN, and the result is a model.We call that model a "Large Language Model", because we hope it contains the patterns language is made of. This is a mistake: the model is so much more interesting than that!An LLM contains patterns that we can recognize as language grammar and patterns we don't understand at all! It didn't model language: it went one step higher on the ladder of abstraction, and modeled text.There are many ways to write an idea into language, but we can only use one at a time. That decision is part of the data we feed into the LLM's training corpus. We can't write all of our ideas at once: we must do one at a time in order. All of that is data in the training corpus.Parsers deal with grammar. Language Grammar is everything that can be written in a language. An LLM doesn't have a clue what we could have been written: it only sees what was written. That's why the model must be "Large": without examples, a valid language pattern doesn't exist.This is where the nature of ambiguity intersects with the nature of LLMs: what example do we want? Can we choose?When we give an LLM a prompt, it gives us a continuation. There were many valid possible continuations: how did it choose just one? It didn't. It isn't working in the realm of possible: it's working in the realm of known. The choice was made all the way back before the LLM was even trained: back when a person wrote an idea into language, into text, that would eventually be used in the training corpus. The content of the training corpus is what resolves ambiguity: not the language model itself.LLMs don't model ambiguity or even language: they model the text they are trained with, ambiguity included. This is a fundamental feature.

nico大约 2 年前

Who cares?As long as the illusion is good enough to be useful, then it doesn’t matter.Just like the illusion of ChatGPT making it seem like the LLM is holding the context of the conversation.AI is already smarter and faster than us in almost any way.Yet so many just keep constantly moving the goalposts so they can feel safe by rejecting AI.Let’s embrace it and use it to our advantage.

评论 #35741730 未加载

评论 #35741779 未加载

评论 #35744112 未加载

26 条评论

andrewmcwatters大约 2 年前

评论 #35740506 未加载

评论 #35739082 未加载

评论 #35740459 未加载

评论 #35741283 未加载

评论 #35740993 未加载

评论 #35739798 未加载

评论 #35741677 未加载

LifeIsBio大约 2 年前

评论 #35743814 未加载

评论 #35742120 未加载

评论 #35742511 未加载

评论 #35741796 未加载

评论 #35748663 未加载

评论 #35742243 未加载

评论 #35741933 未加载

评论 #35742568 未加载

golol大约 2 年前

评论 #35738567 未加载

评论 #35738403 未加载

评论 #35739491 未加载

评论 #35738669 未加载

ftxbro大约 2 年前

choeger大约 2 年前

评论 #35739385 未加载

nuancebydefault大约 2 年前

waffletower大约 2 年前

评论 #35743916 未加载

jmbwell大约 2 年前

janekm大约 2 年前

评论 #35740025 未加载

评论 #35739986 未加载

评论 #35740106 未加载

评论 #35741594 未加载

评论 #35740086 未加载

ben_w大约 2 年前

golol大约 2 年前

tobr大约 2 年前

Chef’s kiss on the title, but I’m afraid I’m not modeling ambiguity well enough myself to disambiguate it correctly.

评论 #35738348 未加载

PicassoCTs大约 2 年前

ggm大约 2 年前

mythrwy大约 2 年前

TazeTSchnitzel大约 2 年前

评论 #35738759 未加载

mathgenius大约 2 年前

Galois theory is an inherent part of language.<a href="https://arxiv.org/abs/0805.2568" rel="nofollow">https://arxiv.org/abs/0805.2568</a>

revelio大约 2 年前

评论 #35739430 未加载

评论 #35740787 未加载

评论 #35740312 未加载

seccode大约 2 年前

Ambiguity is not compressible, as ambiguity scales exponentially with linear input

nottorp大约 2 年前

A large majority of people seem to not like ambiguity in their entertainment. So what's the problem? You can still use the models to generate low quality 'content'.

almstimplmntd大约 2 年前

I would like to see human performance on the introduced benchmark, AmbiEnt, and have it added to Table 4.

jcq3大约 2 年前

Just train them intensively on ambiguity to solve the issue.

seydor大约 2 年前

But they don't explain why they are afraid

qwerty456127大约 2 年前

Surprisingly, GPT4 almost always nails all reasonable implications and apparently understands me much better than 90% of humans I met, even though my English is not perfect.