Maybe it's clear to others, but it's certainly not to me, how exactly transformers - or rather transformer-based LLMs - are operating.<p>I understand how transformers work, but my mental model is that a transformer is the processor and the LLM is an application that runs on it. After all, transformers can be trained to do lots of things, and what it learns when trained with a "predict next word" LLM objective is going to differ from what it learns (and hence operates?) in a different setting.<p>There have been various LLM interpretation papers analyzing aspects of them, such as the discovery of pairs of consecutive layer attention heads acting as "search and copy" "induction heads", and analysis of the linear layers as key-value stores, which perhaps leads to another weak abstraction of the linear layers as storing knowledge and perhaps the reasoning "program", with the "attention" layers being the mechanism being programmed to do the data tagging/shuffling ?<p>No doubt there's a lot more to be discovered about how these LLMs are operating - perhaps a wider variety of primitives built out of attention heads other than just induction heads ? It seems a bit early to be building a high level model of the primitives these LLMs have learnt, and not sure if attempting a crude transformer-level model really works given how the residual context is additive - it's not just tokens being <i>moved</i> around.
This has been built on extensively over the past two years. For instance: Tighter Bounds on the Expressivity of Transformer Encoders <a href="https://arxiv.org/abs/2301.10743" rel="nofollow noreferrer">https://arxiv.org/abs/2301.10743</a>. I find it interesting that transformers are equivalent to first order logic on circuits with counters. Amazing what you can do even if you're not Turing complete!
To quote someone: RASP is like Matlab, designed by Satan.<p>There is an interpreter for a RASP like language if you want to try it out: <a href="https://srush.github.io/raspy/" rel="nofollow noreferrer">https://srush.github.io/raspy/</a><p>And deepmind published a compiler from RASP to Transformer weights: <a href="https://github.com/deepmind/tracr">https://github.com/deepmind/tracr</a>
This is cool but I think a more fundamental primitive is the probability distribution over next tokens and how that changes depending on each layers computation.