The performance advantages of transformer architectures have little to do with "transformers" themselves, but rather other architecture changes that were made along the way. They can be matched or beat in performance by tweaked conventional CNNs: <a href="https://arxiv.org/abs/2201.03545" rel="nofollow">https://arxiv.org/abs/2201.03545</a>
> ViT models are outperforming CNNs in terms of computational efficiency and accuracy, achieving highly competitive performance in tasks like image classification, object detection, and semantic image segmentation.<p>Since then, this has been show to be untrue. Using more modern training techniques along with depthwise convolutions (<a href="https://arxiv.org/abs/2201.03545" rel="nofollow">https://arxiv.org/abs/2201.03545</a>) results in equal if not better performance on vision tasks. Improved training methodologies have also been shown to boost the accuracy of ResNet50 - an 6-year-old pure convolutional architecture - on ImageNet-1k by over 5% (<a href="https://arxiv.org/abs/2110.00476" rel="nofollow">https://arxiv.org/abs/2110.00476</a>).<p>Pure ViTs are also more difficult to train when compared with traditional convnets, although this has since then been somewhat remedied by Swin (<a href="https://arxiv.org/abs/2103.14030" rel="nofollow">https://arxiv.org/abs/2103.14030</a>).
> Unlike BERT, GPT models are unidirectional, and their main advantage is the magnitude of data they were pretrained on: GPT-3, the third-generation GPT model, was trained on 175 billion parameters, about 10 times the size of previous models.<p>Although the number of parameters of a Transformer correlates with the amount of training data, this statement is misleasing. Precisely, the model wasn't trained on parameters but its parameters were trained on data.