科技回声

6 条评论

ap410 个月前

Related: I created a CUDA kernel typically much faster than kernels from cuBLAS when multiplying large square float32 matrices. Tested mostly on a 4090 GPU so far.<p>Source code: <a href="https://github.com/arekpaterek/Faster_SGEMM_CUDA">https://github.com/arekpaterek/Faster_SGEMM_CUDA</a><p><pre><code> size tflops_cublas tflops_my diff gpu 4096² 50.8-50.9 61.8 +21% 4090 6144² 55.3 59.8 +8% 4090 8192² 56.3-56.5 67.1 +19% 4090 12288² 53.7 66.7 +24% 4090 16384² 53.6 66.7 +24% 4090 4096² 28.7-28.8 32.5 +13% 4070ts 4096² 3.8-4.3 6.7 +56-76% T4</code></pre>

评论 #41124660 未加载

hedgehog10 个月前

For those interested in going deeper I think the classic reference in this area is the GotoBLAS paper: <a href="https://www.cs.utexas.edu/~pingali/CS378/2008sp/papers/gotoPaper.pdf" rel="nofollow">https://www.cs.utexas.edu/~pingali/CS378/2008sp/papers/gotoP...</a>

评论 #41127425 未加载

namibj10 个月前

The quoted assembly looks L1D$ bandwidth bound; on most common and vaguely recent architectures one has to use register tiling to saturate the FMA units, as a system unable to do more than one vector load and one vector store each per cycle can't ever fully saturate a single FMA unit on GEMM; for 2 FMA units even 2 vector loads and a vector store per cycle won't be enough without register tiling.

canjobear10 个月前

Quasi-related: Do BLAS libraries ever actually implement Strassen's Algorithm?

评论 #41125974 未加载

Remnant4410 个月前

I honestly didn't realize how performant the decades-old 2013 Haswell architecture is on vector workloads.<p>250GFLOP/core is no joke - He also cross-compared to an M1 Pro, that when not using the secret matrix coprocessor achieves effectively the same vector throughput, a decade later...

评论 #41124771 未加载

kiririn10 个月前

i7 6700 is skylake not haswell

评论 #41126434 未加载

6 条评论

ap410 个月前

评论 #41124660 未加载

hedgehog10 个月前

评论 #41127425 未加载

namibj10 个月前

canjobear10 个月前

Quasi-related: Do BLAS libraries ever actually implement Strassen's Algorithm?

评论 #41125974 未加载

Remnant4410 个月前

评论 #41124771 未加载

kiririn10 个月前

i7 6700 is skylake not haswell

评论 #41126434 未加载

Fast Multidimensional Matrix Multiplication on CPU from Scratch (2022)

6 条评论

Fast Multidimensional Matrix Multiplication on CPU from Scratch (2022)

6 条评论