TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Fast Multidimensional Matrix Multiplication on CPU from Scratch (2022)

74 点作者 georgehill10 个月前

6 条评论

ap410 个月前
Related: I created a CUDA kernel typically much faster than kernels from cuBLAS when multiplying large square float32 matrices. Tested mostly on a 4090 GPU so far.<p>Source code: <a href="https:&#x2F;&#x2F;github.com&#x2F;arekpaterek&#x2F;Faster_SGEMM_CUDA">https:&#x2F;&#x2F;github.com&#x2F;arekpaterek&#x2F;Faster_SGEMM_CUDA</a><p><pre><code> size tflops_cublas tflops_my diff gpu 4096² 50.8-50.9 61.8 +21% 4090 6144² 55.3 59.8 +8% 4090 8192² 56.3-56.5 67.1 +19% 4090 12288² 53.7 66.7 +24% 4090 16384² 53.6 66.7 +24% 4090 4096² 28.7-28.8 32.5 +13% 4070ts 4096² 3.8-4.3 6.7 +56-76% T4</code></pre>
评论 #41124660 未加载
hedgehog10 个月前
For those interested in going deeper I think the classic reference in this area is the GotoBLAS paper: <a href="https:&#x2F;&#x2F;www.cs.utexas.edu&#x2F;~pingali&#x2F;CS378&#x2F;2008sp&#x2F;papers&#x2F;gotoPaper.pdf" rel="nofollow">https:&#x2F;&#x2F;www.cs.utexas.edu&#x2F;~pingali&#x2F;CS378&#x2F;2008sp&#x2F;papers&#x2F;gotoP...</a>
评论 #41127425 未加载
namibj10 个月前
The quoted assembly looks L1D$ bandwidth bound; on most common and vaguely recent architectures one has to use register tiling to saturate the FMA units, as a system unable to do more than one vector load and one vector store each per cycle can&#x27;t ever fully saturate a single FMA unit on GEMM; for 2 FMA units even 2 vector loads and a vector store per cycle won&#x27;t be enough without register tiling.
canjobear10 个月前
Quasi-related: Do BLAS libraries ever actually implement Strassen&#x27;s Algorithm?
评论 #41125974 未加载
Remnant4410 个月前
I honestly didn&#x27;t realize how performant the decades-old 2013 Haswell architecture is on vector workloads.<p>250GFLOP&#x2F;core is no joke - He also cross-compared to an M1 Pro, that when not using the secret matrix coprocessor achieves effectively the same vector throughput, a decade later...
评论 #41124771 未加载
kiririn10 个月前
i7 6700 is skylake not haswell
评论 #41126434 未加载