科技回声

Using two attention layers with √N inputs to cover a context of size N = √N × √N is somewhat intuitively understandable for image data, since the decomposition corresponds to rows and columns.But it's quite surprising that this also works for text data, especially that the fixed pattern performs better than the strided one, despite there not being anything analogous to image boundaries in the data.It'd also be interesting to see what happens for other decompositions, such as 3 layers of ∛N or a logarithmic stack of dilated convolutions.

So "Transformers" are part of the attention-based systems, which are a approach for modeling input-output relationships that is an alternative to Recurrent Neural Networks. These are instead based on Convolutional Neural Networks.The innovation here is that the transformer is compressed, allowing the system to deal with longer sequences.

That's really impressive!However, I'm a bit disappointed with the code release. I was expecting the full source code and setup.

What is NLL for 32x32 Imagenet? Thats a common benchmark and it’s strange that it’s missing from this paper. Also, will you release cifar10 samples? Curious what they look like at 2.80

That's really impressive!However, I'm a bit disappointed with the code release. I was expecting the full source code and setup.

What is NLL for 32x32 Imagenet? Thats a common benchmark and it’s strange that it’s missing from this paper. Also, will you release cifar10 samples? Curious what they look like at 2.80

Generative Modeling with Sparse Transformers

4 条评论

Generative Modeling with Sparse Transformers

4 条评论