ChatGPT has trouble giving an answer before explaining its reasoning

142 点作者 foobuzzHN大约 2 年前

15 条评论

swatcoder大约 2 年前

People get so distracted trying to use certain significant words for what LLM’s do, even when the usage is strained and makes it harder to see how they actually work and what they excel at.A better word for what they do here might be something like “preambulating” — it develops a focus to its later output by grounding more and more tokens into its active context, because they each narrow what else fits. That winnowing effect helps it produce a coherent and rich answer, and when you undermine its opportunity to use that technique, the answers become less coherent and more random.This is not reasoning as that word is traditionally used and doesn’t need to be called that.Yet it’s still a fascinating emergent phenomenon with incredible engineering opportunity. When you call it by something less culturally ambitious and more technically precise, it helps you stay focused on how to use it well and less distracted by some personal desire to prove this is the exact historical moment you want it to be.We need to develop a better vocabulary around these things if we want to stop having the dumb Nascent AGI vs Fancy Autocomplete flamewar.Edit: And I’ll even throw a bone to the Nascent AGI people and say that this kind of preambulating is absolutely something that people do too and easy to characterize as some form of intelligence. But it’s not reasoning, which has specific strong connotations of formality and logic, which don’t hold well with these particular tools.

评论 #35064415 未加载

评论 #35065825 未加载

评论 #35065615 未加载

评论 #35065801 未加载

评论 #35067901 未加载

评论 #35067196 未加载

评论 #35064133 未加载

评论 #35065850 未加载

habitue大约 2 年前

> ChatGPT cannot give an answer that is the result of a "reasoning" before laying out the "reasoning".This is slightly too strong a statement: it can give an answer before reasoning it out, but it only gets a single forward pass of the network to calculate that answer, so it has to be a simpler kind of answer or very obvious reasoning (intuitively, imagine it can only take "one logical step" in a forward pass).Its answers get much better if it uses the context as a scratch pad to write down its thinking from previous passes, this is where Chain of Thought (CoT) comes in. The way language models work is they pass the output to the input over and over, each time generating one token. This means the context is really like a scratch pad recording its previous thoughts.

civilized大约 2 年前

On the "pound of potatoes vs pound of feathers" question - ChatGPT's explanation is so convincing that it really understands the concepts. But it can be juked very easily.I asked it what was heavier, 1,000 tons of helium or a cubic meter of osmium, and it told me that the helium weighs 1,000,000 kg, the osmium 22,600 kg, and therefore the osmium weighs more because it's way denser.It's hard for me to believe that these LLMs are going to learn basic logic if only we throw more training data at them.Transcript:Me: What is heavier, 1000 tons of helium or a cubic meter of osmium?ChatGPT: A cubic meter of osmium is much heavier than 1000 tons of helium.One ton is equal to 1000 kilograms, so 1000 tons is equal to 1,000,000 kilograms. The density of helium is about 0.1785 kilograms per cubic meter. Therefore, 1000 tons of helium would have a volume of:1,000,000 kg / 0.1785 kg/m³ = 5,604,069.77 m³On the other hand, osmium is one of the densest elements known, with a density of around 22,600 kg/m³. Therefore, a cubic meter of osmium weighs:1 m³ x 22,600 kg/m³ = 22,600 kgSo, a cubic meter of osmium weighs much more than 1000 tons of helium.Update: I am continuing the conversation and I am now being explicitly told that, while 1,000,000 kg is much heavier than 22,600 kg, it doesn't change the fact that the osmium is heavier than the helium because the osmium is denser.Update2: I then reminded it about the potatoes and feathers and how density was irrelevant in that context, and shouldn't it therefore be irrelevant in the case of the helium and the osmium? And instead of correcting its response on the helium and osmium, it's now telling me the feathers and potatoes weigh different.Update3: it is now telling me that densities don't matter when comparing masses but do matter when comparing weights. I must say, it has a certain panache in resolving internal inconsistencies in its past responses.Update4: after being corrected half a dozen times with contradictory information, I asked it to state its confidence in its latest story. It said "I can state with a high degree of confidence that my last answer was accurate". The shamelessness!

评论 #35064154 未加载

评论 #35066301 未加载

评论 #35067181 未加载

评论 #35065081 未加载

评论 #35066510 未加载

评论 #35065856 未加载

ndnichols大约 2 年前

The challenge here is that ChatGPT and other LLMs can only think out loud. They only "think" through writing, and that's always displayed to the user.Has anyone tried giving LLMs a scratchpad where the model could e.g. run the pipeline in order, generate the poem, and then explicitly publish it to the user without showing the earlier steps?

评论 #35063408 未加载

评论 #35063998 未加载

评论 #35063572 未加载

评论 #35063600 未加载

评论 #35063429 未加载

ugurnot大约 2 年前

I evaluated ChatGPT on Winogrande Debiased validation set[1], a dataset focused on commonsense reasoning. ChatGPT has an accuracy of 62.75%, below GPT-3's reported accuracy of 77.7%.<a href="https://github.com/ugorsahin/Winogrande_ChatGPT">https://github.com/ugorsahin/Winogrande_ChatGPT</a>

jagraff大约 2 年前

The most interesting part is that the author can "coerce" GPT into giving a completely opposite answer based on requiring the first token to be Yes or No, and the ways that sometimes it skirts around that without breaking the rule.

kybernetikos大约 2 年前

There was an interesting comment a while back about the problem of generating "a" or "an" correctly for a token generator. In order to do so, you have to predict what you'll generate next. Smaller models get this wrong. Even chatgpt, which doesn't get this wrong has limits on its ability to look ahead into its own likely output. I suspect that this is just a difficult task for a token generator and to fix it naturally requires a much bigger model.All these hacks that fix problems by maintaining a "train of thought" are fascinating though, given that we seem to have evolved a similar hack.

golol大约 2 年前

This is exactly what I talked about in this post <a href="https://news.ycombinator.com/item?id=34445896" rel="nofollow">https://news.ycombinator.com/item?id=34445896</a>The reduced version is that decoder only transformer LLMs can not generate a hash of a random animal name followed by the animal name, they can only generate a random animal name followed by its hash (assuming the LLMs is powerful enough to compute hashes correctly in one forward pass in the first place).

axiom92大约 2 年前

We did some work in exploring why spelling out the rationale before the answer works so well!Talk: <a href="https://madaan.github.io/res/presentations/TwoToTango.pdf" rel="nofollow">https://madaan.github.io/res/presentations/TwoToTango.pdf</a>Paper: <a href="https://arxiv.org/pdf/2209.07686.pdf" rel="nofollow">https://arxiv.org/pdf/2209.07686.pdf</a>

KyeRussell大约 2 年前

Really this is where you’re better off just jumping to GPT-3. OpenAI has obviously now muddied the waters with the Chat API, let alone making it so damn cheap. But ChatGPT has been tuned to be conversational and verbose. My experience has been that getting what you want by raw-dogging GPT-3 is much more fruitful.

评论 #35067116 未加载

评论 #35064033 未加载

mrjin大约 2 年前

We don't understand how we understand. Then how can we expect something we created to understand?

cleanchit大约 2 年前

You would too. You just don't speak it out loud.

lwhi大约 2 年前

Could this phenomenon be avoided with the addition of another prompt asking it to take account of the discrepancy?

dwohnitmok大约 2 年前

This is a good thing for keeping some semblance of explainability for an AI.Otherwise you have a true black box.

senectus1大约 2 年前

I wish they would tweak it for honesty.The damn thing lies entirely too easily, then happily will stick with the lie.GhatGPT, if you don't know the answer then tell me that, "I don't know" is an acceptable answer.