Using two attention layers with √N inputs to cover a context of size N = √N × √N is somewhat intuitively understandable for image data, since the decomposition corresponds to rows and columns.<p>But it's quite surprising that this also works for text data, especially that the fixed pattern performs better than the strided one, despite there not being anything analogous to image boundaries in the data.<p>It'd also be interesting to see what happens for other decompositions, such as 3 layers of ∛N or a logarithmic stack of dilated convolutions.