2 点作者 strin超过 5 年前

1 comment

strin超过 5 年前

> In our experiments with Transformers, we observed that not all the attention heads utilize their attention span to the fullest. In fact, in a task of character-level language modeling, most of the heads were using only a small portion of their attention span. If we can take advantage of this property during training, we can reduce the computation time and memory footprint significantly

Making Transformer networks simpler and more efficient

1 comment

Making Transformer networks simpler and more efficient

1 comment