TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Faster sorting with SIMD CUDA intrinsics (2024)

92 点作者 winwang7 天前
Code at <a href="https:&#x2F;&#x2F;github.com&#x2F;wiwa&#x2F;blog-code&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;wiwa&#x2F;blog-code&#x2F;</a>

3 条评论

ashvardanian6 天前
The article covers extremely important CUDA warp-level synchronization&#x2F;exchange primitives, but it&#x27;s not what is generally called SIMD in the CUDA land .<p>Most &quot;CUDA SIMD&quot; intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (&lt;<a href="https:&#x2F;&#x2F;docs.nvidia.com&#x2F;cuda&#x2F;cuda-math-api&#x2F;cuda_math_api&#x2F;group__CUDA__MATH__INTRINSIC__SIMD.html" rel="nofollow">https:&#x2F;&#x2F;docs.nvidia.com&#x2F;cuda&#x2F;cuda-math-api&#x2F;cuda_math_api&#x2F;gro...</a>&gt;). That significantly shrinks their applicability in most domains outside of video and string processing. I&#x27;ve had pretty high hopes for DPX on Hopper (&lt;<a href="https:&#x2F;&#x2F;developer.nvidia.com&#x2F;blog&#x2F;boosting-dynamic-programming-performance-using-nvidia-hopper-gpu-dpx-instructions" rel="nofollow">https:&#x2F;&#x2F;developer.nvidia.com&#x2F;blog&#x2F;boosting-dynamic-programmi...</a>&gt;) instructions and started integrating them in StringZilla last year, but the gains aren&#x27;t huge.
评论 #43902453 未加载
DennisL1236 天前
Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.
评论 #43902483 未加载
fourseventy6 天前
What are the biggest use cases of GPU accelerated sorting?