TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Sqeezing performance out of CUDA

17 点作者 g-garron大约 13 年前

1 comment

pavanky大约 13 年前
Here are a couple of observations<p>1) The implementation may not be the most effecient for a larger matrix or an even more dense matrix. 1% of 52000 is 520. That divided by 32 is 8-9 additions per thread. As that number increases, increasing the number of threads (and eventually using more blocks per row) would be a good idea.<p>2) He is allocating twice as much shared memory than required. I genuinely hope that was an artefact from before. If not, that is a killer for performance. Using more shared memory per block reduces the number of concurrent blocks.<p>Note: Not sure why he is still using cuda 3.2. cuda has had csr multiplication for a few months now, and has even gone through a revision to make it even faster.
评论 #3871563 未加载