I think we're all trying to grok what LLMs like ChatGPT are really doing, and I think the answer really is that it has developed a "world model" for want of better words.<p>Of course an LLM is by design only trained to predict next word (based on the language statistics it has learned) so it's tempting to say that it's just using "surface statistics", but that's a bit too dismissive and ignores the emergent capabilities which indicates there's rather more going on...<p>The thing is, to be a REALLY good "language model" you need to go well beyond grammar or short-range statistical predictions. To be a REALLY good language model you need to learn (and evidentially if based on a large enough transformer, <i>can</i> learn) about abstract contexts so that you can maintain context while generating something likely/appropriate given all the nuances of prompt (whether that's requesting a haiku, or python code, or a continuation of a fairy tale, etc, etc).<p>I guess one way to regard this is that it's learnt statistical patterns on many different levels of hierarchy, and specific to many different contexts (fairy tale vs python code, etc), but of course this means that it's representing and maintaining these (deep, hierarchical) long-range contexts while generating word-by-word output, so it seems inappropriate to just call these "surface statistics", and more descriptive to refer to it as the "world model" it has learned.<p>One indication of the level of abstraction of this world model was shown by a recent paper which proved that the model is representing in it's internal activations whether it's input is true or not (correctly predicting that it will regard the negation as false), which can only reflect that this is a concept/context is had to learn to predict well in some circumstances. For example, if generating a news story then it's going to be best to maintain a truthy context, but for a fairly tale not so much!<p>I think how we describe, and understand, these very capable LLMs needs to go beyond their mechanics and training goals and reflect what we can deduce they've learned and what they are capable of. If the model is (literally) representing concepts as abstract as truth, then that seems to go far beyond what might be reasonably be called "surface statistics". While I think these architectures need to be elaborated to add key capabilities needed for AGI, its perhaps also worth noting that the other impressive "predictive intelligence", our brain, a wetware machine, could also be regarded as generating behavior only based on learned statistics, but at some point deep hierarchical context-dependent statistics are best called something else - a world model.