> we discuss the phenomena of emergent abilities, which we define as abilities that are not present in small models but are present in larger models<p>Reading anything by major researchers in AI feels like an adversarial battle where they're trying to misuse as much technical scientific and philosophical language as possible and we adjacent are trying to hold the line.<p>In philosophy and esp. the philosophy of science, emergence is a relation between a whole and its parts such that a property of the whole does not obtain just in virtue of properties of its parts taken in isolation. "Emergence" has this prior positive, semi-magical, scientific association which confuses the issue in this case.<p>No properties of the LLM obtain from its parts differently as parameters scale, the mechanism is the same. The performance differs not due to emergence, but due to the "modelling gap" present between the statistical structure of free text and that of mathematics. With enough examples, the gap closes... indeed, you can model the addition function (add(x, y) = x + y) just by an infinite sample of its domains.<p>A better technical term here might be "scale-dependent capabilities". For LLM, simple arithmetic is extremely scale dependent, whereas basic text generation is less-so. The reason for this seems obvious, as given above... so the use of the term "emergence" here I interpert as more PRish mystification.
Wikipedia has a fine definition of what _emergent_ means:<p>> In philosophy, systems theory, science, and art, emergence occurs when an entity is observed to have properties its parts do not have on their own, properties or behaviors that emerge only when the parts interact in a wider whole.<p>The linked article uses this definition:<p>> we discuss the phenomena of emergent abilities, which we define as abilities that are not present in small models but are present in larger models<p>The concept in the paper has to do with capabilities / abilities that grow non-linearly as a function of model size. This is distinctly different from _emergent behavior_ in systems theory.<p><opinion>The authors and reviewers could find a better word for their concept. There is no need to muddle the concept.</opinion><p>Furthermore, the idea that networks of certain sizes are necessary for certain kinds of representational abilities is not new. Perhaps a term exists already?
Do these scale-dependent (I like this adjective better than "emergent") properties survive model distillation? It may be that our training/optimization processes are inefficient and require these scales to achieve, but the underlying model may not actually require the number of parameters that we are giving them. I haven't read any of the papers about distillation yet, does anyone know if this has been tested?
Has there been any efforts in processing calculation prompts, where instead of letting it internally 'compute', it's trained to identify equations and process them with an external calculator instead (perhaps one which outputs not only the result but the individual steps too)?
There's a quite accessible IAS presentation[1] from another Google researcher on <i>Solving Quantitative Reasoning Problems with Language Models</i> which gives some likely related background on having language models solve this type of math problem, including the "chain of thought" technique mentioned here.<p>I found it pretty interesting and as something of an ML skeptic was a bit surprised at the degree of coherence shown in "reasoning" examples similar to the ones in the linked article.<p>1: <a href="https://www.youtube.com/watch?v=qV4Ku5L4BuMt">https://www.youtube.com/watch?v=qV4Ku5L4BuMt</a>
The X axis here is the training Flops but what about parameter size and how does it account for the different architectures. Comparing apples to shoelaces may not be a fruitful approach or indicative of what to expect from ever-expanding scale. Also , is it emergence or overfitting