TechEcho

13 comments

Bimos3 months ago

> FFMA SASS interleaving> We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).> To leverage this, we develop a similar script to modify the FFMA instructions in the compiled binary. Besides simply modifying the yield bit, we also flip the reuse bit (registers cannot be reused if the warp is yielded). This adjustment improves performance (10%+ in some cases) for fine-grained scaling FP8 GEMMs by creating more opportunities to overlap MMA instructions with promotion FFMA instructions.I would say it is really mind-blowing.

评论 #43180011 未加载

评论 #43181070 未加载

评论 #43179887 未加载

fulafel3 months ago

This kind of stuff is an intersting demonstration of how far compilers are from extracting high performance from hardware based on high level code.What would it take for traditional compiler tech or AI assisted optimization agents to come up with something like it?

评论 #43181937 未加载

shihab3 months ago

The speedup figures they report are compared to their own cutlass-based baseline. Has anyone done a performance comparison against cuBLAS?All cutlass results I have seen so far for Gemm are within ~10% of cuBLAS. If the 2x-2.5x speedup they report holds up that would be extremely impressive.

评论 #43193952 未加载

评论 #43190474 未加载

WiSaGaN3 months ago

I think these kind of open-source is really showing their objective of achieving efficiency in the industry. The reason is this kind of software benefits a lot to the big guys serving the model (competitors to Deekseek themselves if they are interested in being a provider) rather than to the general open-source community that wants to learn and tinker or serve model in consumer hardware.

评论 #43180756 未加载

jmward013 months ago

I'm not sure the lower and lower precision optimization is a good idea long term. It indicates that models are really sparse and that may be true right now but I think that is likely just because we have some bad ideas about how to train them and not because they really should be that sparse.

评论 #43180931 未加载

评论 #43181127 未加载

nbonaparte3 months ago

This might be rendered moot by native microscaling support in Blackwell (MXFP). They've manually done a coarser-grained version of that for Hopper, but with full FP32 scaling factors.

评论 #43182112 未加载

alecco3 months ago

Wow, MIT license. I hope some big players embrace this open source cooperative approach.

niemandhier3 months ago

I keep wondering why there even are undocumented instruction.Wouldn’t it make sense to provide these to the user? Even if they might not be perfectly reliable.This stuff must be documented internally, why not just release it?Security by obscurity does not work: Your competitor reverse engineer everything you do anyways.

评论 #43181973 未加载

评论 #43181748 未加载

dr_kretyn3 months ago

Honestly, this is beyond my usage and understanding. But I really appreciate such sharing findings and improvements so that everyone can benefit from them. It's a refreshment.

评论 #43180040 未加载

m3kw93 months ago

The 20$ question, what can I do with this?

评论 #43180067 未加载

buyucu3 months ago

these guys are on fire! seriously, kudos to the deepseek team.

cde-v3 months ago

Interesting timing with NVDA releasing results tomorrow.

评论 #43180106 未加载

fulafel3 months ago

It seems mostly Python, which is nice: <a href="https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/jit_kernels/gemm.py">https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/...</a>

评论 #43180354 未加载

13 comments

Bimos3 months ago

评论 #43180011 未加载

评论 #43181070 未加载

评论 #43179887 未加载

fulafel3 months ago

评论 #43181937 未加载

shihab3 months ago

评论 #43193952 未加载

评论 #43190474 未加载

WiSaGaN3 months ago

评论 #43180756 未加载

jmward013 months ago

评论 #43180931 未加载

评论 #43181127 未加载

nbonaparte3 months ago

This might be rendered moot by native microscaling support in Blackwell (MXFP). They've manually done a coarser-grained version of that for Hopper, but with full FP32 scaling factors.

评论 #43182112 未加载

alecco3 months ago

Wow, MIT license. I hope some big players embrace this open source cooperative approach.

niemandhier3 months ago

评论 #43181973 未加载

评论 #43181748 未加载

dr_kretyn3 months ago

Honestly, this is beyond my usage and understanding. But I really appreciate such sharing findings and improvements so that everyone can benefit from them. It's a refreshment.

评论 #43180040 未加载

m3kw93 months ago

The 20$ question, what can I do with this?

评论 #43180067 未加载

buyucu3 months ago

these guys are on fire! seriously, kudos to the deepseek team.

cde-v3 months ago

Interesting timing with NVDA releasing results tomorrow.

评论 #43180106 未加载

fulafel3 months ago

It seems mostly Python, which is nice: <a href="https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/jit_kernels/gemm.py">https://github.com/deepseek-ai/DeepGEMM/blob/main/deep_gemm/...</a>

评论 #43180354 未加载

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

13 comments

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

13 comments