An LLM necessarily has to create <i>some</i> sort of internal "model" / representations pursuant to its "predict next word" training goal, given the depth and sophistication of context recognition needed to to well. This isn't an N-gram model restricted to just looking at surface word sequences.<p>However, the question should be what <i>sort</i> of internal "model" has it built? It seems fashionable to refer to this as a "world model", but IMO this isn't really appropriate, and certainly it's going to be quite different to the predictive representations that any animal that <i>interacts</i> with the world, and learns from those interactions, will have built.<p>The thing is that an LLM is an auto-regressive model - it is trying to predict continuations of training set samples solely based on word sequences, and is not privy to the world that is actually being described by those word sequences. It can't model the generative process of the humans who created those training set samples because <i>that</i> generative process has different inputs - sensory ones (in addition to auto-regressive ones).<p>The "world model" of a human, or any other animal, is built pursuant to predicting the environment, but not in a purely passive way (such as a multi-modal LLM predicting next frame in a video). The animal is primarily concerned with predicting the outcomes of it's <i>interactions</i> with the environment, driven by the evolutionary pressure to learn to act in way that maximizes survival and proliferation of its DNA. This is the nature of a real "world model" - it's modelling the world (as perceived thru sensory inputs) as a dynamical process reacting to the actions of the animal. This is very different to the passive "context patterns" learnt by an LLM that are merely predicting auto-regressive continuations (whether just words, or multi-modal video frames/etc).