Very cool. I recommend scrolling down to look at the example problem that O3 still can’t solve. It’s clear what goes on in the human brain to solve this problem: we look at one example, hypothesize a simple rule that explains it, and then check that hypothesis against the other examples. It doesn’t quite work, so we zoom into an example that we got wrong and refine the hypothesis so that it solves that sample. We keep iterating in this fashion until we have the simplest hypothesis that satisfies all the examples. In other words, how humans do science - iteratively formulating, rejecting and refining hypotheses against collected data.<p>From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3/4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:<p>1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)<p>2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to <i>revise</i> the hypothesis in the simplest possible way that also explains this example.<p>3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.<p>4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.<p>5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.