This is super cool! Especially matrix mult getting similar or better perf than cuBLAS! If anyone is interested on other kernels like swiglu, geglu, RMS layernorm, I coded some at <a href="https://github.com/unslothai/unsloth/tree/main/unsloth/kernels">https://github.com/unslothai/unsloth/tree/main/unsloth/kerne...</a>