Late to the party, but I think my summary is (L is context length, C is hidden dimension, H is headsize, C = H * nh):<p>3.1 Optimised attention: Instead of using a learned W_V to project from C to H, slice V into H sized vectors. (V is just the input tokens X). This is because the matrix multiply is to a lower dimension anyway, so why not just slice. Slicing is just reshaping (L, C) -> (L, nh, H)<p>3.2 Efficient attention: I think this opens with a typo, "In the last section, we discussed how and why we can remove W_O..." should be W_V not W_O I think. Anyways, same as above, just for the keys this time. Reshape K (which is just X) from (L, C) -> (L, nh, H)<p>3.3 Super attention: Introduce an (L, L) W_A (lower triangular for masked) that transforms V on the left (X again) from (L, C) -> (L, C) (whereas standard attention has W_V (C, C) that transforms (L, C) -> (L, C) from the right). And they share W_A between heads.More efficient when C > L, so for long context models, probably not more efficient.<p>I think the first two modifications are equivalent to just setting W_V and W_K to the constant identity matrices right? So that makes me think what would happen if you instead restrict W_V (and/or W_K, W_Q) to be block diagonal (though non square) such that each head has in effect an (H, H) matrix which transforms the slice of X it receives. This is different than standard attention right? Because there the W_V acts over the full C dimension. Almost surely someone has thought of this so I will try to find out<p>Still learning so all this could be wrong