TechEcho

4 comments

Maybe it's clear to others, but it's certainly not to me, how exactly transformers - or rather transformer-based LLMs - are operating.I understand how transformers work, but my mental model is that a transformer is the processor and the LLM is an application that runs on it. After all, transformers can be trained to do lots of things, and what it learns when trained with a "predict next word" LLM objective is going to differ from what it learns (and hence operates?) in a different setting.There have been various LLM interpretation papers analyzing aspects of them, such as the discovery of pairs of consecutive layer attention heads acting as "search and copy" "induction heads", and analysis of the linear layers as key-value stores, which perhaps leads to another weak abstraction of the linear layers as storing knowledge and perhaps the reasoning "program", with the "attention" layers being the mechanism being programmed to do the data tagging/shuffling ?No doubt there's a lot more to be discovered about how these LLMs are operating - perhaps a wider variety of primitives built out of attention heads other than just induction heads ? It seems a bit early to be building a high level model of the primitives these LLMs have learnt, and not sure if attempting a crude transformer-level model really works given how the residual context is additive - it's not just tokens being moved around.

评论 #36341942 未加载

评论 #36341918 未加载

评论 #36345760 未加载

评论 #36344932 未加载

inciampatialmost 2 years ago

This has been built on extensively over the past two years. For instance: Tighter Bounds on the Expressivity of Transformer Encoders <a href="https://arxiv.org/abs/2301.10743" rel="nofollow noreferrer">https://arxiv.org/abs/2301.10743</a>. I find it interesting that transformers are equivalent to first order logic on circuits with counters. Amazing what you can do even if you're not Turing complete!

评论 #36339706 未加载

roliszalmost 2 years ago

To quote someone: RASP is like Matlab, designed by Satan.There is an interpreter for a RASP like language if you want to try it out: <a href="https://srush.github.io/raspy/" rel="nofollow noreferrer">https://srush.github.io/raspy/</a>And deepmind published a compiler from RASP to Transformer weights: <a href="https://github.com/deepmind/tracr">https://github.com/deepmind/tracr</a>

评论 #36339380 未加载

评论 #36340586 未加载

评论 #36343175 未加载

ljlolelalmost 2 years ago

This is cool but I think a more fundamental primitive is the probability distribution over next tokens and how that changes depending on each layers computation.

4 comments

HarHarVeryFunnyalmost 2 years ago

评论 #36341942 未加载

评论 #36341918 未加载

评论 #36345760 未加载

评论 #36344932 未加载

inciampatialmost 2 years ago

评论 #36339706 未加载

roliszalmost 2 years ago

评论 #36339380 未加载

评论 #36340586 未加载

评论 #36343175 未加载

ljlolelalmost 2 years ago

This is cool but I think a more fundamental primitive is the probability distribution over next tokens and how that changes depending on each layers computation.

Thinking Like Transformers (2021) [pdf]

4 comments

Thinking Like Transformers (2021) [pdf]

4 comments