科技回声

ashvardanian6 天前

The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .<p>Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<<a href="https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__SIMD.html" rel="nofollow">https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...</a>>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<<a href="https://developer.nvidia.com/blog/boosting-dynamic-programming-performance-using-nvidia-hopper-gpu-dpx-instructions" rel="nofollow">https://developer.nvidia.com/blog/boosting-dynamic-programmi...</a>>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.

评论 #43902453 未加载

DennisL1236 天前

Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.

评论 #43902483 未加载

fourseventy6 天前

What are the biggest use cases of GPU accelerated sorting?

Faster sorting with SIMD CUDA intrinsics (2024)

3 条评论

Faster sorting with SIMD CUDA intrinsics (2024)

3 条评论