The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .<p>Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<<a href="https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH__INTRINSIC__SIMD.html" rel="nofollow">https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...</a>>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<<a href="https://developer.nvidia.com/blog/boosting-dynamic-programming-performance-using-nvidia-hopper-gpu-dpx-instructions" rel="nofollow">https://developer.nvidia.com/blog/boosting-dynamic-programmi...</a>>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.