So, I'm just a layman when it comes to AI/ML, but I do understand computability — what's possible to do with a given machine, and how we can build higher-computational-power primitives out of lower-computational-power primitives by plugging those primitives together with "glue" like parallel feed-forward chains (e.g. an ALU adder's carry bits) and loops over static sub-states of execution.<p>My own mental model for what Transformers <i>must necessarily</i> be doing, in order to be able to compute what they compute, given:<p>1. the primitives they're made of (for Transformers: matmul a learned matrix; vector-add a learned bias vector; normalize; softmax)<p>2. what those primitives can compute over a single layer<p>3. the low-ish total number of layers in a Transformer model<p>...is that they were already effectively "state space models" in practice. So this doesn't really surprise me!<p>(To be explicit, my assertion is that, for a given latent space between layers N and N+1 in a Transformer model, that latent space encodes a set of state variables [think CPU registers] used by the Nth serial computation steps of an arbitrary set of learned algorithms — where these algorithms are limited to those where every computation step is possible to encode in the form of a fused-matmul-plus-vadd, such that the algorithm itself can be learned as a depthwise-extruded sequence of weights across the layers; and where the learned algorithms can and do share state variables, both as inputs and as outputs; and where these state variables are all attenuated by an activation probability [in a Transformer: attention] such that the algorithms' outputs form a pre-multiplied <i>conditional probability</i> of the output given the confidence of the inputs — in turn such that the same state variable can be a low-confidence output for one algorithm, and a high-confidence output for another algorithm, and the high-confidence component of the output will swamp the low-confidence output.)