>> Only 2 of the 9 LLMs solved the "list all ways" prompt, but 7 out of 9 solved the "write a program" prompt. The language that a problem-solver uses matters! Sometimes a natural language such as English is a good choice, sometimes you need the language of mathematical equations, or chemical equations, or musical notation, and sometimes a programming language is best. Written language is an amazing invention that has enabled human culture to build over the centuries (and also enabled LLMs to work). But human ingenuity has divised other notations that are more specialized but very effective in limited domains.<p>If I understand correctly, Peter Norvig's argument is about the relative expressivity and precision of Python and natural language with respect to a particular kind of problem. He's saying that Python is a more appropriate language to express factorisation problems, and their solutions, than natural language.<p>Respectfully -very respectfully- I disagree. The much simpler explanation is that there are many more examples, in the training set of most LLMs, of factorisation problems and their solutions in Python (and other programming languages), than in natural language. Examples in Python etc. are also likely to share more common structure, even down to function and variable names [1], so there are more statistical regularities for a language model to overfit-to, during training.<p>We know LLMs do this. We even know how they do it, to an extent. We've known since the time of BERT. For example:<p><i>Probing Neural Network Comprehension of Natural Language Arguments</i><p><a href="https://aclanthology.org/P19-1459/" rel="nofollow">https://aclanthology.org/P19-1459/</a><p><i>Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference</i><p><a href="https://aclanthology.org/P19-1334/" rel="nofollow">https://aclanthology.org/P19-1334/</a><p>Given these and other prior results Peter Norvig's single experiment is not enough, and not strong enough, evidence to support his alternative hypothesis. Ideally, we would be able to test an LLM by asking it to solve a factorisation problem in a language in which we can ensure there are very few examples of a solution, but that is unfortunately very hard to do.<p>______________<p>[1] Notice for instance how Llama 3.1 immediately identifies the problem as "find_factors", even though there's no such instruction in the two prompts. That's because it's seen that kind of code in the context of that kind of question during training. The other LLMs seem to take terms from the prompts instead.