Automated neural architecture search would have eventually found the transformer, and better architectures. Reading the paper, the names of the vectors (QKV) seem dressed up to imply that the authors know how exactly a transformer actually works, but I am not convinced. It just seems to me these folks where the first to stumble upon it. They didn’t invent self-attention, either. Right place, right time, tons of free TPU to experiment with.