Recent results show that LLMs struggle with compositional tasks

345 点作者 marban3 个月前

27 条评论

The main thesis here seems to be that LLMs behave like almost all other machine learning models, in that they are doing pattern matching on their input data, and short circuiting to a statistically likely result. Chain of thought reasoning is still bound by this basic property of reflexive pattern matching, except the LLM is forced to go through a process of iteratively refining the domain it does matching on.Chain of thought is interesting, because you can combine it with reinforcement learning to get models to solve (seemingly) arbitrarily hard problems. This comes with the caveat that you need some reward model for all RL. This means you need a clear definition of success, and some way of rewarding being closer to success, to actually solve those problems.Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

评论 #42908114 未加载

评论 #42907389 未加载

评论 #42911910 未加载

评论 #42906565 未加载

评论 #42907092 未加载

评论 #42908297 未加载

评论 #42908233 未加载

评论 #42910430 未加载

评论 #42908434 未加载

评论 #42913147 未加载

评论 #42907821 未加载

评论 #42908134 未加载

评论 #42908228 未加载

评论 #42910205 未加载

评论 #42915309 未加载

antirez3 个月前

LLMs keep showing more and more they are the wonder of AI that we awaited for decades: talking machines that every two months do progresses that two months before were anticipated impossible because of <put here some limit that was actually in the prejudice of the skeptical AI community> (just stochastic parrots, no reasoning possible without symbolic representations, there are no longer tokens, ...)At the same time, part of the scientific community continues to diminish what was accomplished and the steps that are being made. A few months ago LeCun arrived to tell new researchers to move away from LLMs since they are a dead end: imagine the disservice he made to the surely non-zero folks that followed the advice, putting themselves out of the AI research that matters. (Incidentally, this skepticism of the Meta AI head must have something to do with the fact that Meta, despite the huge efforts allocated, produced the worst LLM among Anthropic, OpenAI, DeepSeek -- I bet Zuckerberg is asking questions lately).It's very hard to explain this behavior if not by psychological denial.[EDIT: you can't see the score of this comment, but I can: it's incredible how it goes from 3, to -2, to 1, and so forth. The community is split in two, and it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth]

评论 #42906734 未加载

评论 #42908845 未加载

评论 #42908412 未加载

评论 #42908184 未加载

评论 #42911653 未加载

评论 #42908399 未加载

评论 #42906798 未加载

评论 #42906787 未加载

评论 #42906995 未加载

评论 #43011187 未加载

评论 #42906811 未加载

评论 #42912986 未加载

评论 #42910605 未加载

评论 #42908202 未加载

starchild30013 个月前

What a poorly informed article. It's very shallow and out of touch with LLM research. As it stands 6-12 months old models are system 1 thinkers, everybody knows this and knew this even at the time. You need system 2 thinking (test time compute) for more complex logical, algorithmic and reasoning tasks. We knew this when Daniel kahneman wrote thinking fast, thinking slow (over a decade ago) and we still know it today. So LLMs can think but they have to be programmed to think (a la system 2, reasoning, thinking models). There's nothing inherently wrong or limited with LLMs themselves as far as we can tell.

评论 #42917699 未加载

评论 #42911555 未加载

geoffhill3 个月前

Idk, `o3-mini-high` was able to pop this Prolog code out in about 20 seconds:<pre><code> solve(WaterDrinker, ZebraOwner) :- % H01: Five houses with positions 1..5. Houses = [ house(1, _, norwegian, _, _, _), % H10: Norwegian lives in the first house. house(2, blue, _, _, _, _), % H15: Since the Norwegian lives next to the blue house, house(3, _, _, milk, _, _), % and house1 is Norwegian, house2 must be blue. house(4, _, _, _, _, _), house(5, _, _, _, _, _) ], % H02: The Englishman lives in the red house. member(house(_, red, englishman, _, _, _), Houses), % H03: The Spaniard owns the dog. member(house(_, _, spaniard, _, dog, _), Houses), % H04: Coffee is drunk in the green house. member(house(_, green, _, coffee, _, _), Houses), % H05: The Ukrainian drinks tea. member(house(_, _, ukrainian, tea, _, _), Houses), % H06: The green house is immediately to the right of the ivory house. right_of(house(_, green, _, _, _, _), house(_, ivory, _, _, _, _), Houses), % H07: The Old Gold smoker owns snails. member(house(_, _, _, _, snails, old_gold), Houses), % H08: Kools are smoked in the yellow house. member(house(_, yellow, _, _, _, kools), Houses), % H11: The man who smokes Chesterfields lives in the house next to the man with the fox. next_to(house(_, _, _, _, _, chesterfields), house(_, _, _, _, fox, _), Houses), % H12: Kools are smoked in a house next to the house where the horse is kept. next_to(house(_, _, _, _, horse, _), house(_, _, _, _, _, kools), Houses), % H13: The Lucky Strike smoker drinks orange juice. member(house(_, _, _, orange_juice, _, lucky_strike), Houses), % H14: The Japanese smokes Parliaments. member(house(_, _, japanese, _, _, parliaments), Houses), % (H09 is built in: Milk is drunk in the middle house, i.e. house3.) % Finally, find out: % Q1: Who drinks water? member(house(_, _, WaterDrinker, water, _, _), Houses), % Q2: Who owns the zebra? member(house(_, _, ZebraOwner, _, zebra, _), Houses). right_of(Right, Left, Houses) :- nextto(Left, Right, Houses). next_to(X, Y, Houses) :- nextto(X, Y, Houses); nextto(Y, X, Houses). </code></pre> Seems ok to me.<pre><code> ?- solve(WaterDrinker, ZebraOwner). WaterDrinker = norwegian, ZebraOwner = japanese .</code></pre>

评论 #42906399 未加载

评论 #42906339 未加载

评论 #42906041 未加载

评论 #42906332 未加载

评论 #42907575 未加载

评论 #42907087 未加载

评论 #42906639 未加载

评论 #42906418 未加载

评论 #42910373 未加载

mmcnl3 个月前

There's so much talk about the advancements in AI/LLMs, yet for me ChatGPT as of this date is basically just a faster search engine without cookie banners, clickbait and ads. It hallucinates a lot and it can keep very limited context. Why is there so much promise about future progress but so little actual progress?

评论 #42909481 未加载

评论 #42912877 未加载

mikeknoop3 个月前

One must now ask whether research results are analyzing pure LLMs (eg. gpt-series) or LLM synthesis engines (eg. o-series, r-series). In this case, the headline is summarizing a paper originally published in 2023 and does not necessarily have bearing on new synthesis engines. In fact, evidence strongly suggests the opposite given o3's significant performance on ARC-AGI-1 which requires on-the-fly composition capability.

评论 #42906488 未加载

kebsup3 个月前

I've managed to get llms fail on simple questions, that require thinking graphically - 2D or 3D.An example would be: you have a NxM grid. How many shapes of XYZ shape can you fit on it?However, thinking of the transformer video games, AI can be trained to have a good representation of 2D/3D worlds. I wonder how it can be combined so that this graphical representation is used to compute text output.

klodolph3 个月前

When one of these limitations gets spelled out in an article, it feels like six months later, somebody has a demo of a chatbot without that particular limitation.These limitations don’t seem in any way “fundamental” to me. I’m sure there are a ton of people gluing LLMs to SAT solvers as we speak.

评论 #42905910 未加载

评论 #42905779 未加载

评论 #42906679 未加载

评论 #42905909 未加载

changoplatanero3 个月前

By the time these academic studies get published they are usually already several months out of date. o3-mini was released yesterday and if one wants to know about the limitations of current technology they are much better to check twitter than some research paper

评论 #42905938 未加载

评论 #42906470 未加载

评论 #42905927 未加载

评论 #42906010 未加载

mohsen13 个月前

<a href="https://chatgpt.com/share/679f0353-12bc-8007-91cf-dd63d52044a0" rel="nofollow">https://chatgpt.com/share/679f0353-12bc-8007-91cf-dd63d52044...</a>O1 Pro gets the answer with proper reasoningCan’t tell if this is dude to data contamination or it really figured it out?How can we form the question in another way to avoid data contamination?

评论 #42906324 未加载

评论 #42906381 未加载

Workaccount23 个月前

"Recent results" = two year old study looking at GPT-3, 3.5, and first gen 4.

WillAdams3 个月前

Perhaps applying:<a href="https://www.doc.ic.ac.uk/~re14/Evans-R-2020-PhD-Thesis.pdf" rel="nofollow">https://www.doc.ic.ac.uk/~re14/Evans-R-2020-PhD-Thesis.pdf</a>"Kant’s Cognitive Architecture"will help?> ...we provide a precise formalization of what it means> to “make sense” of a sensory sequence. According to our definition, making sense means constructing> a symbolic causal theory that explains the sensory sequence and satisfies a set of unity conditions> that were inspired by Kant’s discussion in the first half of the Critique of Pure Reason. According to> our interpretation, making sense of sensory input is a type of program synthesis, but it is unsupervised> program synthesis.

comeonbro3 个月前

> Chatbot Software Begins to Face Fundamental Limitations> Recent results show that large language models struggle with compositional tasks, suggesting a hard limit to their abilities.Your first question with anything like this should always be WHICH MODELS:> For our experiments, we evaluate the performance of 6 LLMs: GPT4 (gpt-4) [58], ChatGPT (GPT3.5-turbo) [57], GPT3 (text-davinci-003) [11], FlanT5 [17] and LLaMa [75].This is ancient. This research was done centuries ago. This is research about the possibility of isotopes, written about radium in 1903, published in 1946. It is a criminal level of journalistic malpractice to leave uninformed readers with the impression that this is where AI stands yesterday.

malaigo3 个月前

I'd like to propose a modified Zebra puzzle. Still has unique solutions. Not on the internet, so not part of the training set. Similar problem description complexity. Just remove the last constraint (Norwegian lives next to the blue house) and asked negated questions: Who cannot be the water drinker? Who cannot own the zebra? Text of problem here: sandiway.arizona.edu/mzebra.pdf (I tested it on free ChatGPT and DeepSeek-R1 <a href="https://youtu.be/3gotauWUcew" rel="nofollow">https://youtu.be/3gotauWUcew</a>)

anon2913 个月前

This is ultimately a basic adaptation of the pigeonhole principle and is not surprising. A finite system of matrix multiplications cannot be turing complete. You cannot expect one trip through a series of matrix multiplications and bias additions and a final sampling at the end which commits it to a certain answer to ever produce a correct answer. It's a mathematical impossibility. No talk of quantum woo, emergent phenomenen, or whatever other pseudo-science has arisen to explain AI intelligence can get around this simple truth of mathematics.However, chain of thought reasoning where token streams can continue ad infinitum could potentially solve large swaths of problems whose ordinary solutions require turing machines. It could also solve problems that cannot generally be solved by turing machines, but where you only need solutions for a few classes of problems.Either way, even with chain of thought, you would expect that... in some instances, the model output diverges and does not complete. And unsurprisingly... this is exactly what you see with the DeepSeek models (and other CoT models) when you pose it difficult questions. It will never emit the </think> tag.

emorning33 个月前

I'm a software developer with just a basic understanding of LLMs.But its not surprising to me that LLMs wouldnt be good at composition and planning. To me, LLMs seem like half a brain, the right half, that does pattern matching and such.To me, the current crop of AI seems to be missing the other half of the brain, the part that's good at planning and composition and such.LLMs can tell me all I want to know about logic and deduction, so its funny that LLMs cant understand their own words.

d0mine3 个月前

> multilayer transformers indeed cannot solve certain complicated compositional tasks> chain-of-thought prompting essentially turns a large problem into a sequence of smaller problems, making it possible for transformers to tackle more complex compositional tasks--- [out of order]> the model could be trained on 20-digit numbers and still reliably (with 98% accuracy) add 100-digit numbers, whereas a model trained without the extra positional embedding was only about 3% accurate

评论 #42905882 未加载

EGreg3 个月前

Maybe they’re not training them on the right thing, maybe math can be a different “mode” just lime sound or images

have-a-break3 个月前

Maybe it's just me but when i code/program i typically think of how different variations of implementations would affect roi of the associated company i would be working for.That's something that AI in the current state probably wouldn't be able to take into account.

anshumankmr3 个月前

Isn't the math calculation thing solvable through the integration with a simple Python script for basic math and more complex stuff something like an open source version of Wolfram?

seydor3 个月前

"Fundamental" results cannot come from empirical findings, they have to be proven rigorously. Until then, they are not fundamental

simonw3 个月前

Here's "Einstein’s puzzle" from this paper: <a href="https://www.researchgate.net/publication/341189675_Is_Einstein's_Puzzle_Over-Specified" rel="nofollow">https://www.researchgate.net/publication/341189675_Is_Einste...</a><pre><code> H01 There are five houses. H02 The Englishman lives in the red house. H03 The Spaniard owns the dog. H04 Coffee is drunk in the green house. H05 The Ukrainian drinks tea. H06 The green house is immediately to the right of the ivory house. H07 The Old Gold smoker owns snails. H08 Kools are smoked in the yellow house. H09 Milk is drunk in the middle house. H10 The Norwegian lives in the first house. H11 The man who smokes Chesterfields lives in the house next to the man with the fox. H12 Kools are smoked in a house next to the house where the horse is kept. H13 The Lucky Strike smoker drinks orange juice. H14 The Japanese smokes Parliaments. H15 The Norwegian lives next to the blue house. Now, Q1 Who drinks water? Q2 Who owns the zebra?</code></pre>

评论 #42905958 未加载

评论 #42906033 未加载

评论 #42905964 未加载

评论 #42905945 未加载

评论 #42905985 未加载

评论 #42905959 未加载

apsec1123 个月前

"showing that multilayer transformers indeed cannot solve certain complicated compositional tasks. Basically, some compositional problems will always be beyond the ability of transformer-based LLMs."Pretty sure this is just false and the paper doesn't show this. I could be misunderstanding, but it looks like the result is only about a single token/forward pass, not a reasoning model with many thousands of tokens like o1/o3

评论 #42906065 未加载

strangescript3 个月前

Why do articles testing ancient LLMs keep getting posted?

Ndv_MC3 个月前

Quite deep and informative. One thing to point out here is how they measure accuracy when testing their LLMs. Like most neural networks, the n-dimensional space of LLMs is largely sparse. In simple terms, this means that it’s very easy to “exercise” the model in areas where it’s weak. That doesn’t mean that the model doesn’t have the potential of doing a (much) better job if you “exercise” its space properly—which is exactly what techniques like CoT basically do. LLMs are inherently limited, and they don’t have “true” reasoning capabilities—beyond marketing hype, this much should be obvious to most serious ML researchers today. The only question is how well we can get them to “mimic” reasoning for practical applications, and this is where “prompt engineering”, strictly speaking, is a true form of engineering, which has to take into account the mathematical foundations of the models and figure out how to extract out of them the best performance they can deliver.

novemp3 个月前

It feels like every few weeks someone puts out a new article breathlessly talking about how bad the fancy autocorrect bot is at math and logic like it's brand new information. How have we as a society not gotten this through our skulls yet?

MagicMoonlight3 个月前

Do you really need a study to work out that a Markov chain can’t solve problem questions? It feels pretty intuitive.LLMs are not intelligent. Human intelligence is stored in text, so mimicking text mimicks that intelligence. The sheer quantity of text means they’ve probably seen something similar to the prompt.If LLMs were an intelligent system they wouldn’t need to steal 1000TB of text and media content. Have you ever seen a person require a million books? It shows that the learning is brute force rather than intelligent.It’s close, but it’s just not there. The methodology is wrong.

评论 #42907583 未加载

评论 #42907594 未加载