TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

391 pointsby mfiguiere3 months ago

13 comments

Bimos3 months ago
&gt; FFMA SASS interleaving<p>&gt; We observe a performance improvement in the CUTLASS FP8 kernel between NVCC 12.2 and 12.3. By comparing the compiled SASS, we discover that one bit in a series of FADD instructions is flipped in an interleaving pattern. After referencing some open-source CUDA assembler implementations, we identified that this bit controls yield, which may enhance warp-level parallelism (just a guess, yielding the current warp and let other warps work).<p>&gt; To leverage this, we develop a similar script to modify the FFMA instructions in the compiled binary. Besides simply modifying the yield bit, we also flip the reuse bit (registers cannot be reused if the warp is yielded). This adjustment improves performance (10%+ in some cases) for fine-grained scaling FP8 GEMMs by creating more opportunities to overlap MMA instructions with promotion FFMA instructions.<p>I would say it is really mind-blowing.
评论 #43180011 未加载
评论 #43181070 未加载
评论 #43179887 未加载
fulafel3 months ago
This kind of stuff is an intersting demonstration of how far compilers are from extracting high performance from hardware based on high level code.<p>What would it take for traditional compiler tech or AI assisted optimization agents to come up with something like it?
评论 #43181937 未加载
shihab3 months ago
The speedup figures they report are compared to their own cutlass-based baseline. Has anyone done a performance comparison against cuBLAS?<p>All cutlass results I have seen so far for Gemm are within ~10% of cuBLAS. If the 2x-2.5x speedup they report holds up that would be extremely impressive.
评论 #43193952 未加载
评论 #43190474 未加载
WiSaGaN3 months ago
I think these kind of open-source is really showing their objective of achieving efficiency in the industry. The reason is this kind of software benefits a lot to the big guys serving the model (competitors to Deekseek themselves if they are interested in being a provider) rather than to the general open-source community that wants to learn and tinker or serve model in consumer hardware.
评论 #43180756 未加载
jmward013 months ago
I&#x27;m not sure the lower and lower precision optimization is a good idea long term. It indicates that models are really sparse and that may be true right now but I think that is likely just because we have some bad ideas about how to train them and not because they really should be that sparse.
评论 #43180931 未加载
评论 #43181127 未加载
nbonaparte3 months ago
This might be rendered moot by native microscaling support in Blackwell (MXFP). They&#x27;ve manually done a coarser-grained version of that for Hopper, but with full FP32 scaling factors.
评论 #43182112 未加载
alecco3 months ago
Wow, MIT license. I hope some big players embrace this open source cooperative approach.
niemandhier3 months ago
I keep wondering why there even are undocumented instruction.<p>Wouldn’t it make sense to provide these to the user? Even if they might not be perfectly reliable.<p>This stuff must be documented internally, why not just release it?<p>Security by obscurity does not work: Your competitor reverse engineer everything you do anyways.
评论 #43181973 未加载
评论 #43181748 未加载
dr_kretyn3 months ago
Honestly, this is beyond my usage and understanding. But I really appreciate such sharing findings and improvements so that everyone can benefit from them. It&#x27;s a refreshment.
评论 #43180040 未加载
m3kw93 months ago
The 20$ question, what can I do with this?
评论 #43180067 未加载
buyucu3 months ago
these guys are on fire! seriously, kudos to the deepseek team.
cde-v3 months ago
Interesting timing with NVDA releasing results tomorrow.
评论 #43180106 未加载
fulafel3 months ago
It seems mostly Python, which is nice: <a href="https:&#x2F;&#x2F;github.com&#x2F;deepseek-ai&#x2F;DeepGEMM&#x2F;blob&#x2F;main&#x2F;deep_gemm&#x2F;jit_kernels&#x2F;gemm.py">https:&#x2F;&#x2F;github.com&#x2F;deepseek-ai&#x2F;DeepGEMM&#x2F;blob&#x2F;main&#x2F;deep_gemm&#x2F;...</a>
评论 #43180354 未加载