> FFMA SASS interleaving<p>> We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).<p>> To leverage this, we develop a similar script to modify the FFMA instructions in the compiled binary. Besides simply modifying the yield bit, we also flip the reuse bit (registers cannot be reused if the warp is yielded). This adjustment improves performance (10%+ in some cases) for fine-grained scaling FP8 GEMMs by creating more opportunities to overlap MMA instructions with promotion FFMA instructions.<p>I would say it is really mind-blowing.
This kind of stuff is an intersting demonstration of how far compilers are from extracting high performance from hardware based on high level code.<p>What would it take for traditional compiler tech or AI assisted optimization agents to come up with something like it?
The speedup figures they report are compared to their own cutlass-based baseline. Has anyone done a performance comparison against cuBLAS?<p>All cutlass results I have seen so far for Gemm are within ~10% of cuBLAS. If the 2x-2.5x speedup they report holds up that would be extremely impressive.
I think these kind of open-source is really showing their objective of achieving efficiency in the industry. The reason is this kind of software benefits a lot to the big guys serving the model (competitors to Deekseek themselves if they are interested in being a provider) rather than to the general open-source community that wants to learn and tinker or serve model in consumer hardware.
I'm not sure the lower and lower precision optimization is a good idea long term. It indicates that models are really sparse and that may be true right now but I think that is likely just because we have some bad ideas about how to train them and not because they really should be that sparse.
This might be rendered moot by native microscaling support in Blackwell (MXFP). They've manually done a coarser-grained version of that for Hopper, but with full FP32 scaling factors.
I keep wondering why there even are undocumented instruction.<p>Wouldn’t it make sense to provide these to the user? Even if they might not be perfectly reliable.<p>This stuff must be documented internally, why not just release it?<p>Security by obscurity does not work: Your competitor reverse engineer everything you do anyways.
Honestly, this is beyond my usage and understanding. But I really appreciate such sharing findings and improvements so that everyone can benefit from them. It's a refreshment.
It seems mostly Python, which is nice: <a href="https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/jit_kernels/gemm.py">https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/...</a>