I find the distinction introduced in this paper into encoder-decoder Transformers, encoder-only Transformers and decoder-only Transformers very useful for my informal understanding of the different architectures. Thank you for this clear clarification.
I can't tell who this paper is aimed at. It isn't formal. It isn't mathematical. It isn't a good description and doesn't have good coverage. I can only assume it is for citations.