> Modern CUDA uses explicit programmer-managed masks, which is powerful and takes advantage of their hardware specifics. But mis-using the mask can cause a deadlock, as divergent threads could simply never participate in a subgroup operation that expects them to, leaving the other threads to block forever. I can see why this solution leaves to be desired, as it just offloads the problem and the risk of misuse to the user.<p>Note, on Volta (2017, present on customer since Turing in 2018) onwards, Independent Thread Scheduling is present, with a separate instruction pointer per SIMT thread.<p>This allows to have atomics across different lanes of the same warp, as such providing the guarantees assumed by the C++ memory model. Quite some modern CUDA apps are starting to rely on that, and as such will not work on Pascal or earlier, nevermind other GPU vendors.<p>Cooperative Groups are very flexible in CUDA too.<p><a href="https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#sm-independent-thread-scheduling" rel="nofollow">https://docs.nvidia.com/cuda/volta-tuning-guide/index.html#s...</a><p>As such, control flow is handled very differently on post-Volta GPUs compared to pre-Volta ones, with pre-Volta more akin to what AMD still does today.
The article mentions WebAssembly as having the same issue as spir-v with structured control flow, but actually in Wasm it is quite a bit better, because you are allowed to break/continue from an arbitrarily nested block.<p>This allows you to convert any reducible CFG without losing runtime performance, and only pay a price for irreducible ones (which are somewhat rare).<p>Shameless plug: I wrote an article about solving the structured control flow problem in WebAssembly -> <a href="https://medium.com/leaningtech/solving-the-structured-control-flow-problem-once-and-for-all-5123117b1ee2" rel="nofollow">https://medium.com/leaningtech/solving-the-structured-contro...</a>
This is a very good introduction to the inherent difficulty that comes from trying to do SIMT execution / whole program vectorization while at the same time giving programmers the power of certain optimization tricks that punch through the SIMT abstraction and expose the underlying vector architecture (via subgroup/wave operations).<p>The title is somewhat misleading as this trouble isn't specific to SPIR-V. It is inherent to the field, and DXIL has the same problem. (Arguably it's worse there because Microsoft tends to be quite bad at properly specifying semantics of DXIL and DirectX more generally.)