As mentioned, these are all toy implementations and you should not use them in production. If you want to the fast, easy, and extremely optimized way of doing things, use torch.nn.MultiheadAttention or torch.nn.functional.scaled_dot_product_attention so that you get the optimal implementations. You can use xformers scaled dot product attention if you want the bleeding edge of performance.<p>> (Note that the code presented in this article is intended for illustrative purposes. If you plan to implement self-attention for training LLMs, I recommend considering optimized implementations like Flash Attention, which reduce memory footprint and computational load.)<p>Flash attention is already part of torch's kernels as of torch 2, but the latest versions and optimizations land in xformers first.
<p><pre><code> conscious, kŏn′shəs, adjective -- Characterized by or having an awareness of one's environment and one's own existence, sensations, and thoughts. synonym: aware.
</code></pre>
Self-attention seems to be at least a proxy for "awareness of ... one's own existence." If that closed loop is the thing that converts sensibility into sentience, then maybe it's the source of LLM's leverage too. Is this language comprehension algorithm a sort of consciousness algorithm?