Why do LLMs have emergent properties?

78 балловавтор: Bostonianоколо 17 часов назад

20 comments

anon373839около 15 часов назад

I remain skeptical of emergent properties in LLMs in the way that people have used that term. There was a belief 3-4 years ago that if you just make the models big enough, they magically acquire intelligence. But since then, we’ve seen that the models are actually still pretty limited by the training data: like other ML models, they interpolate well between the data they’ve been trained on, but they don’t generalize well beyond it. Also, we have seen models that are 50-100x smaller now exhibit the same “emergent” capabilities that were once thought to require hundreds of billions of parameters. I personally think the emergent properties really belong to the data instead.

评论 #43931962 未加载

评论 #43932475 未加载

评论 #43931973 未加载

lordnachoоколо 16 часов назад

What seems a bit miraculous to me is, how did the researchers who put us on this path come to suspect that you could just throw more data and more parameters at the problem? If the emergent behavior doesn't appear for moderate sized models, how do you convince management to let you build a huge model?

评论 #43931454 未加载

评论 #43931450 未加载

评论 #43933170 未加载

评论 #43931858 未加载

评论 #43931892 未加载

评论 #43931749 未加载

评论 #43932493 未加载

gyomuоколо 14 часов назад

The reasoning in the article is interesting, but this struck me as a weird example to choose:> “The real question is how can we predict when a new LLM will achieve some new capability X. For example, X = “Write a short story that resonates with the social mood of the present time and is a runaway hit”Framing a capability as something that is objectively measurable (“able to perform math on the 12th grade level”, “able to write a coherent, novel text without spelling/grammar mistakes”) makes sense within the context of what the author is trying to demonstrate.But the social proof aspect (“is a runaway hit”) feels orthogonal to it? Things can be runaway hits for social factors independently of the capability they actually represent.

评论 #43932336 未加载

评论 #43933156 未加载

zmmmmmоколо 13 часов назад

This seems superficial and doesn't really get to the heart of the question. To me it's not so much about bits and parameters but a more interesting fundamental question of whether pure language itself is enough to encompass and encode higher level thinking.Empirically we observe that an LLM trained purely to predict a next token can do things like solve complex logic puzzles that it has never seen before. Skeptics claim that actually the network has seen at least analogous puzzles before and all it is doing is translating between them. However the novelty of what can be solved is very surprising.Intuitively it makes sense that at some level, that intelligence itself becomes a compression algorithm. For example, you can learn separately how to solve every puzzle ever presented to mankind, but that would take a lot of space. At some point it's more efficient to just learn "intelligence" itself and then apply that to the problem of predicting the next token. Once you do that you can stop trying to store an infinite database of parallel heuristics and just focus the parameter space on learning "common heuristics" that apply broadly across the problem space, and then apply that to every problem.The question is, at what parameter count and volume of training data does the situation flip to favoring "learning intelligence" rather than storing redundant domain specialised heuristics? And is it really happening? I would have thought just looking at the activation patterns could tell you a lot, because if common activations happen for entirely different problem spaces then you can argue that the network has to be learning common abstractions. If not, maybe it's just doing really large scale redundant storage of heurstics.

评论 #43934383 未加载

评论 #43932998 未加载

juancnоколо 15 часов назад

I always wondered if the specific dimensionality of the layers and tensors has a specific effect on the model.It's hard to explain, but higher dimensional spaces have weird topological properties, not all behave the same way and some things are perfectly doable in one set of dimensions while for others it just plain doesn't work (e.g. applying surgery on to turn a shape into another).

评论 #43931883 未加载

intersticeоколо 15 часов назад

The bag of heuristics thing is interesting to me, is it not conceivable that a NN of a certain size trained only on math problems would be able to wire up what amounts to a calculator? And if so, could that form part of a wider network, or is I/O from completely different modalities not really possible in this way?

andy99около 16 часов назад

I didn't follow entirely on a fast read, but this confused me especially:<pre><code> The parameter count of an LLM defines a certain bit budget. This bit budget must be spread across many, many tasks </code></pre> I'm pretty sure that LLMs, like all big neural networks, are massively under-specified, as in there are way more parameters than data to fit (understanding the training data set is bigger than the size of the model, but the point is the same loss can be achieved with many different combinations of parameters).And I think of this underspecification as the reason neural networks extrapolate cleanly and this generalize.

评论 #43931381 未加载

评论 #43933246 未加载

eutropiaоколо 9 часов назад

I often find that people using the word emergent to describe properties of a system tend to ascribe quasi magical properties to the system. Things tend to get vague and hand wavy when that term comes up.Just call them properties with unknown provenance.

评论 #43934061 未加载

waynecochranоколо 16 часов назад

Since gradient descent converges on a local minima, would we expect different emergent properties with different initialization of the weights?

评论 #43933184 未加载

评论 #43931416 未加载

Michelangelo11около 16 часов назад

How could they not?Emergent properties are unavoidable for any complex system and probably exponentially scale with complexity or something (I'm sure there's an entire literature about this somewhere).One good instance are spandrels in evolutionary biology. The wikipedia article is a good explanation of the subject: <a href="https://en.m.wikipedia.org/wiki/Spandrel_(biology)" rel="nofollow">https://en.m.wikipedia.org/wiki/Spandrel_(biology)</a>

评论 #43933228 未加载

评论 #43931703 未加载

cratermoonоколо 16 часов назад

Alternate view: Are Emergent Abilities of Large Language Models a Mirage? <a href="https://arxiv.org/abs/2304.15004" rel="nofollow">https://arxiv.org/abs/2304.15004</a>"Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale. Specifically, nonlinear or discontinuous metrics produce apparent emergent abilities, whereas linear or continuous metrics produce smooth, continuous predictable changes in model performance."

评论 #43931109 未加载

评论 #43931407 未加载

评论 #43931836 未加载

评论 #43931203 未加载

评论 #43931107 未加载

me3memeоколо 16 часов назад

Metaphor: finding a path from a initial point to a destination in a graph. As the number of parameters increases one can expect the LLM to be able to remember how to go from one place to another and in the end it should be able to find a long path. This can be an emergent property since with less parameters the LLM could not be able to find the correct path. Now one has to find what kind of problems this metaphor is a good model of.

评论 #43931713 未加载

nthingtohideоколо 15 часов назад

What do you think about this analogy?A simple process produces a Mandelbrot set. A simple process (loss minimization through gradient descent) produces LLMs. So what plays the role of 2D-plane or dense point grid in the case of LLMs? It is the embeddings, (or ordered combinations of embeddings ) which are generated after pre-training. In case of a 2D plan, the closeness between two points is determined by our numerical representation schema. But in case of embeddings, we learn the 2D-grid of words (playing the role of points) by looking at how the words are getting used in corpusThe following is a quote from Yuri Manin, an eminent Mathematician.<a href="https://www.youtube.com/watch?v=BNzZt0QHj9U" rel="nofollow">https://www.youtube.com/watch?v=BNzZt0QHj9U</a> Of the properties of mathematics, as a language, the most peculiar one is that by playing formal games with an input mathematical text, one can get an output text which seemingly carries new knowledge. The basic examples are furnished by scientific or technological calculations: general laws plus initial conditions produce predictions, often only after time-consuming and computer-aided work. One can say that the input contains an implicit knowledge which is thereby made explicit.I have a related idea which I picked up from somewhere which mirrors the above observation.When we see beautiful fractals generated by simple equations and iterative processes, we give importance to only the equations, not to the cartesian grid on which that process operates.

评论 #43932911 未加载

评论 #43931699 未加载

samirillianоколо 14 часов назад

*Do

OtherShrezzingоколо 16 часов назад

It feels like this can be tracked with addition. Humans expect “can do addition” is a binary skill, because humans either can or cannot add.LLMs approximate addition. For a long time they would produce hot garbage. Then after a lot of training, they could sum 2 digit numbers correctly.At this point we’d say “they can do addition”, and the property has emerged. They have passed a binary skill threshold.

评论 #43931159 未加载

unsupp0rtedоколо 16 часов назад

Isn't "emergent properties" another way to say "we're not very good at understanding the capabilities of complex systems"?

评论 #43931449 未加载

评论 #43931338 未加载

评论 #43931200 未加载

评论 #43931630 未加载

评论 #43931190 未加载

评论 #43931677 未加载

评论 #43931694 未加载

评论 #43931687 未加载

评论 #43931246 未加载

teekertоколо 15 часов назад

Perhaps we should ask: Why do humans pick arbitrary points on a continuum beyond which things are labeled “emergent”?

评论 #43933209 未加载

评论 #43931691 未加载

scopemouthwashоколо 15 часов назад

The authors haven’t demonstrated emergence of LLMs. If I write a piece of code and it does what I programmed it to do that’s not emergence. LLMs aren’t doing anything unexpected yet. I think that’s the smell test because emergence is still subjective.

评论 #43931700 未加载

评论 #43931680 未加载

评论 #43933063 未加载

评论 #43931709 未加载

chasing0entropyоколо 16 часов назад

There are eerie similarities in radiographs of LLM inference output and mammalian EEGs. I would be surprised not see latent and surprisingly complicated characteristics become apparent as context and recursive algorithms grow larger.

评论 #43931370 未加载

RigelKentaurusоколо 15 часов назад

I'm not a techie, so perhaps someone can help me understand this: AFAIK, no theoretical computer scientist predicted emergence in AI models. Doesn't that suggest that the field of theoretical computer science (or theoretical AI, if you will) is suspect? It's like Lord Kelvin saying that heavier-than-air flying machines are impossible a decade before the Wright brothers' first flight.

评论 #43932049 未加载

评论 #43931880 未加载

评论 #43931903 未加载

评论 #43932005 未加载