> In our experiments with Transformers, we observed that not all the attention heads utilize their attention span to the fullest. In fact, in a task of character-level language modeling, most of the heads were using only a small portion of their attention span. If we can take advantage of this property during training, we can reduce the computation time and memory footprint significantly