I remain skeptical of emergent properties in LLMs in the way that people have used that term. There was a belief 3-4 years ago that if you just make the models big enough, they magically acquire intelligence. But since then, we’ve seen that the models are actually still pretty limited by the training data: like other ML models, they interpolate well between the data they’ve been trained on, but they don’t generalize well beyond it. Also, we have seen models that are 50-100x smaller now exhibit the same “emergent” capabilities that were once thought to require hundreds of billions of parameters. I personally think the emergent properties really belong to the data instead.
What seems a bit miraculous to me is, how did the researchers who put us on this path come to suspect that you could just throw more data and more parameters at the problem? If the emergent behavior doesn't appear for moderate sized models, how do you convince management to let you build a huge model?
The reasoning in the article is interesting, but this struck me as a weird example to choose:<p>> “The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit”<p>Framing a capability as something that is objectively measurable (“able to perform math on the 12th grade level”, “able to write a coherent, novel text without spelling/grammar mistakes”) makes sense within the context of what the author is trying to demonstrate.<p>But the social proof aspect (“is a runaway hit”) feels orthogonal to it? Things can be runaway hits for social factors independently of the capability they actually represent.
This seems superficial and doesn't really get to the heart of the question. To me it's not so much about bits and parameters but a more interesting fundamental question of whether pure language itself is enough to encompass and encode higher level thinking.<p>Empirically we observe that an LLM trained purely to predict a next token can do things like solve complex logic puzzles that it has never seen before. Skeptics claim that actually the network has seen at least analogous puzzles before and all it is doing is translating between them. However the novelty of what can be solved is very surprising.<p>Intuitively it makes sense that at some level, that intelligence itself becomes a compression algorithm. For example, you can learn separately how to solve every puzzle ever presented to mankind, but that would take a lot of space. At some point it's more efficient to just learn "intelligence" itself and then apply that to the problem of predicting the next token. Once you do that you can stop trying to store an infinite database of parallel heuristics and just focus the parameter space on learning "common heuristics" that apply broadly across the problem space, and then apply that to every problem.<p>The question is, at what parameter count and volume of training data does the situation flip to favoring "learning intelligence" rather than storing redundant domain specialised heuristics? And is it really happening? I would have thought just looking at the activation patterns could tell you a lot, because if common activations happen for entirely different problem spaces then you can argue that the network has to be learning common abstractions. If not, maybe it's just doing really large scale redundant storage of heurstics.
I always wondered if the specific dimensionality of the layers and tensors has a specific effect on the model.<p>It's hard to explain, but higher dimensional spaces have weird topological properties, not all behave the same way and some things are perfectly doable in one set of dimensions while for others it just plain doesn't work (e.g. applying surgery on to turn a shape into another).
The bag of heuristics thing is interesting to me, is it not conceivable that a NN of a certain size trained only on math problems would be able to wire up what amounts to a calculator? And if so, could that form part of a wider network, or is I/O from completely different modalities not really possible in this way?
I didn't follow entirely on a fast read, but this confused me especially:<p><pre><code> The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks
</code></pre>
I'm pretty sure that LLMs, like all big neural networks, are massively under-specified, as in there are way more parameters than data to fit (understanding the training data set is bigger than the size of the model, but the point is the same loss can be achieved with many different combinations of parameters).<p>And I think of this underspecification as the reason neural networks extrapolate cleanly and this generalize.
I often find that people using the word emergent to describe properties of a system tend to ascribe quasi magical properties to the system. Things tend to get vague and hand wavy when that term comes up.<p>Just call them properties with unknown provenance.
How could they not?<p>Emergent properties are unavoidable for any complex system and probably exponentially scale with complexity or something (I'm sure there's an entire literature about this somewhere).<p>One good instance are spandrels in evolutionary biology. The wikipedia article is a good explanation of the subject: <a href="https://en.m.wikipedia.org/wiki/Spandrel_(biology)" rel="nofollow">https://en.m.wikipedia.org/wiki/Spandrel_(biology)</a>
Alternate view: Are Emergent Abilities of Large Language Models a Mirage? <a href="https://arxiv.org/abs/2304.15004" rel="nofollow">https://arxiv.org/abs/2304.15004</a><p>"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."
Metaphor: finding a path from a initial point to a destination in a graph. As the number of parameters increases one can expect the LLM to be able to remember how to go from one place to another and in the end it should be able to find a long path. This can be an emergent property since with less parameters the LLM could not be able to find the correct path. Now one has to find what kind of problems this metaphor is a good model of.
What do you think about this analogy?<p>A simple process produces a Mandelbrot set.
A simple process (loss minimization through gradient descent) produces LLMs.
So what plays the role of 2D-plane or dense point grid in the case of LLMs?
It is the embeddings, (or ordered combinations of embeddings ) which are generated after pre-training.
In case of a 2D plan, the closeness between two points is determined by our numerical representation schema.
But in case of embeddings, we learn the 2D-grid of words (playing the role of points) by looking at how the words are getting used in corpus<p>The following is a quote from Yuri Manin, an eminent Mathematician.<p><a href="https://www.youtube.com/watch?v=BNzZt0QHj9U" rel="nofollow">https://www.youtube.com/watch?v=BNzZt0QHj9U</a>
Of the properties of mathematics, as a language, the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge. The basic examples are furnished by scientific or technological calculations: general laws plus initial conditions produce predictions, often only after time-consuming and computer-aided work. One can say that the input contains an implicit knowledge which is thereby made explicit.<p>I have a related idea which I picked up from somewhere which mirrors the above observation.<p>When we see beautiful fractals generated by simple equations and iterative processes,
we give importance to only the equations, not to the cartesian grid on which that process
operates.
It feels like this can be tracked with addition. Humans expect “can do addition” is a binary skill, because humans either can or cannot add.<p>LLMs approximate addition. For a long time they would produce hot garbage. Then after a lot of training, they could sum 2 digit numbers correctly.<p>At this point we’d say “they can do addition”, and the property has emerged. They have passed a binary skill threshold.
Isn't "emergent properties" another way to say "we're not very good at understanding the capabilities of complex systems"?
The authors haven’t demonstrated emergence of LLMs. If I write a piece of code and it does what I programmed it to do that’s not emergence. LLMs aren’t doing anything unexpected yet. I think that’s the smell test because emergence is still subjective.
There are eerie similarities in radiographs of LLM inference output and mammalian EEGs. I would be surprised not see latent and surprisingly complicated characteristics become apparent as context and recursive algorithms grow larger.
I'm not a techie, so perhaps someone can help me understand this: AFAIK, no theoretical computer scientist predicted emergence in AI models. Doesn't that suggest that the field of theoretical computer science (or theoretical AI, if you will) is suspect? It's like Lord Kelvin saying that heavier-than-air flying machines are impossible a decade before the Wright brothers' first flight.