I find their critique compelling, particularly their emphasis on the disconnect between CoT’s algorithmic mimicry and true cognitive exploration. The authors illustrate this with examples from advanced mathematics, such as the "windmill problem" from the International Mathematics Olympiad, a puzzle whose solution eludes brute-force sequential thinking. These cases underscore the limits of a framework that relies on static datasets and rigid generative processes. CoT, as they demonstrate, falters not because it cannot generate solutions, but because it cannot conceive of them in ways that mirror human ingenuity.<p>As they say - "Superintelligence isn't about discovering new things; it's about discovering new ways to discover."
This is the big idea in the paper, basically that CoT is limited for some complex problems because there is a class of problems where there is no 'textbook' way to find a solution. These are novel problems that need a unique methodology. "Essentially, to start generating the solution requires that we already know the full approach. The underlying generative process of the solution is not auto-regressive from left-to-right."<p>Mathematical meaning:<p>We can formalize this argument through the interpretation of reasoning as a latent variable process (Phan et al., 2023). In particular, classical CoT can be viewed as (equation) i.e., the probability of the final answer being produced by a marginalization over latent reasoning
chains.<p>We claim that for complex problems, the true solution generating process should be viewed as (equation) i.e., the joint probability distribution of the solution (a, s1, . . . , s) is conditioned on the latent generative process. Notice that this argument is a meta-generalization of the prior CoT argument, hence why we will refer to the process q → z1 → . . . → z as Meta-CoT.<p>I think this is seminal. It is getting at heart of some issues. Ask o1-pro how you could make a 1550nm laser diode operating at 1ghz have low geometric loss without an expensive collimator using commodity materials or novel manufacturing approaches using first principle physics and the illusion is lost that o1-pro is a big deal. 'Novel' engineering is out of reach because there is no text book on how to do novel engineering and these class of problems is 'not auto-regressive from left-to-right'.
> That is, language
models learn the implicit meaning in text, as opposed to the early belief some researchers held that
sequence-to-sequence models (including transformers) simply fit correlations between sequential
words.<p>Is this so, that the research community is agreed? Are there papers discussing this topic?
>> Behind this approach is a simple principle often abbreviated as "compression is intelligence", or the model must approximate the distribution of data and perform implicit reasoning in its activations in order to predict the next token (see Solomonoff Induction; Solomonoff 1964)<p>For the record, the word "intelligence" appears in the two parts of "A Formal Theory of Inductive Inference" (referenced above) a total of 0 times. The word "Compression" appears a total of 0 times. The word "reasoning" once; in the phrase "using similar reasoning".<p>Unsurprisingly, Solomonoff's work was preoccupied with Inductive Inference. I don't know that he ever said anything bout "compression is intelligence" but I believe this is an idea, and a slogan, that was developed only much later. I am not sure where it comes from, originally.<p>It is correct that Solomonoff induction was very much about predicting the next symbol in a sequence of symbols; not necessarily linguistic tokens, either. The common claim that LLMs are "in their infancy" or similar are dead wrong. Language modelling is basically ancient (in CS terms) and we have long since crossed in the era of its technological maturity.<p>_______________<p>[1] <a href="https://raysolomonoff.com/publications/1964pt1.pdf" rel="nofollow">https://raysolomonoff.com/publications/1964pt1.pdf</a><p>[2] <a href="https://raysolomonoff.com/publications/1964pt2.pdf" rel="nofollow">https://raysolomonoff.com/publications/1964pt2.pdf</a>
Congrats to the authors for a thoughtful work! I have been thinking and working on related ideas for a few months now but did not yet spent commensurate compute on them and might have gone in a different direction; this work certainly helps create better baselines along the way of making better use of decoder transformer architectures. Please keep it coming!
I'm a little curious, would anyone have a way to know how many researchers research something they came up with, vs researching something being done by an independent developer online, it being picked up and then researched and reported on?
The example in the paper using an plug-and-chug algebra equation, and the step-by-step process to solve it, reinforces the notion that LLMs can only reproduce recipes they have seen before. This is really no different than how we learn mathematics in school, the teacher shows a starting point and moves, step-by-step, to the end of the process. Calling this "Meta Chain-of-Thought" feels like an aggrandizement of basic educational process to me. Next we'll be labeling the act of holding basic utensils as Layered Physical Kineticism, or something contrived like that. In school this "Meta Chain of Thought" was called "Show your work." Is this really a "phenomena" that needs explaining? It might teach us more about how we achieve logical induction (steps of reasoning) but we are pretty deep in the soup to be able to describe accurately the shape of the pot.
Meta's recently released Large Concept Models + this Meta Chain of Thought sounds very promising for AGI. The timeline of 2030 sounds increasingly plausible IMO.