As mentioned, these are all toy implementations and you should not use them in production. If you want to the fast, easy, and extremely optimized way of doing things, use torch.nn.MultiheadAttention or torch.nn.functional.scaled_dot_product_attention so that you get the optimal implementations. You can use xformers scaled dot product attention if you want the bleeding edge of performance.<p>> (Note that the code presented in this article is intended for illustrative purposes. If you plan to implement self-attention for training LLMs, I recommend considering optimized implementations like Flash Attention, which reduce memory footprint and computational load.)<p>Flash attention is already part of torch's kernels as of torch 2, but the latest versions and optimizations land in xformers first.