TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

GEMM: From Pure C to SSE Optimized Micro Kernels (2014)

101 点作者 felixr大约 7 年前

3 条评论

stochastic_monk大约 7 年前
This is a great master class, if you will, on iterative optimization for core loops. They progressively make improvements to their method until they approach the performance of standard, optimized BLAS implementations. (About as fast as BLIS, less than Intel&#x27;s MKL, which is about as fast as OpenBLAS. [0]) Timing Eigen without linking against BLAS is a little misleading, since Eigen is meant to be linked against a system BLAS.<p>You wouldn&#x27;t want to use this code, but it shows you the sorts of things to start paying attention to in this performance-critical sections. I was most surprised by the fact that reordering operations to spread the same instructions apart made a significant difference.<p>(As an aside, your best bet in practical tools is using a metaprogramming library [Blaze seems to be the best], wrapping core operations in a fast BLAS implementation. I personally choose to use Blaze on top of OpenBLAS.)<p>[0] <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=10114830" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=10114830</a>
评论 #16600273 未加载
评论 #16598209 未加载
mratsim大约 7 年前
I&#x27;ve implemented the whole course in Nim in my tensor library: <a href="https:&#x2F;&#x2F;github.com&#x2F;mratsim&#x2F;Arraymancer&#x2F;tree&#x2F;master&#x2F;src&#x2F;tensor&#x2F;fallback" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mratsim&#x2F;Arraymancer&#x2F;tree&#x2F;master&#x2F;src&#x2F;tenso...</a><p>All BLAS libraries currently only care about float32 and float64, I wanted very fast routines for integer matrix multiplication and use this as a fallback for integers while using OpenBLAS&#x2F;MKL&#x2F;CLBlast (OpenCL)&#x2F;CuBLAS (CUDA) for floats.<p>Thanks to this I achieved 10x speed compared to Julia and 22x speed compared to Numpy on a 1500x1500 int64 matrix multiplication on CPU: <a href="https:&#x2F;&#x2F;github.com&#x2F;mratsim&#x2F;Arraymancer#micro-benchmark-int64-matrix-multiplication" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mratsim&#x2F;Arraymancer#micro-benchmark-int64...</a>
评论 #16600305 未加载
mishurov大约 7 年前
I appreciate the work done at the hardware level for streaming computations of structured data. But from the high level view, it&#x27;s more critical to be able to compute efficiently matrix decomposition, it helps to solve systems of linear equations and obviously discretised PDEs.
评论 #16599677 未加载