A very good and reasonably approachable discussion of the pros and cons of how CUDA programming actually is realized in hardware. The explanation of how the GPU's handle context switching is particularly thoughtful and enlightening. It took me a long time to figure this out a couple months ago, a guide like this would have saved me a few nights.<p>I was surprised that the author didn't once use the term CUDA though, they even discuss actual syntax from it, but don't mention the language (extension) once.
Very nice article. In my limited experience with OpenCL programming, the most difficult thing is understanding how memory access patterns affect performance. It's not made easier by the fact that it may be different on different platforms.<p>I wonder if what's needed is a higher-level representation that can compile to the best access patterns for the given hardware. (And something that can try several access patterns for your problem and choose the most efficient one.) GPU programming is still quite new, so I guess it's bound to show up eventually.<p>If it can't handle <i>all</i> possible situations, such a tool would be still be useful, even if you end up having to go down to the CUDA/OpenCL level for certain problems that are too difficult to express declaratively.