> This is in part because of the work by Google on the NVPTX LLVM back-end.<p>I'm one of the maintainers at Google of the LLVM NVPTX backend. Happy to answer questions about it.<p>As background, Nvidia's CUDA ("CUDA C++?") compiler, nvcc, uses a fork of LLVM as its backend. Clang can also compile CUDA code, using regular upstream LLVM as its backend. The relevant backend in LLVM was originally contributed by nvidia, but these days the team I'm on at Google is the main contributor.<p>I don't know much (okay, anything) about Julia except what I read in this blog post, but the dynamic specialization looks a lot like XLA, a JIT backend for TensorFlow that I work on. So that's cool; I'm happy to see this work.<p><i>Full debug information is not supported by the LLVM NVPTX back-end yet, so cuda-gdb will not work yet.</i><p>We'd love help with this. :)<p><i>Bounds-checked arrays are not supported yet, due to a bug [1] in the NVIDIA PTX compiler.</i> [0]<p>We ran into what appears to be the same issue [2] about a year and a half ago. nvidia is well aware of the issue, but I don't expect a fix except by upgrading to Volta hardware.<p>[0] <a href="https://julialang.org/blog/2017/03/cudanative" rel="nofollow">https://julialang.org/blog/2017/03/cudanative</a>
[1] <a href="https://github.com/JuliaGPU/CUDAnative.jl/issues/4" rel="nofollow">https://github.com/JuliaGPU/CUDAnative.jl/issues/4</a>
[2] <a href="https://bugs.llvm.org/show_bug.cgi?id=27738" rel="nofollow">https://bugs.llvm.org/show_bug.cgi?id=27738</a>
In my experience, CUDA / OpenCL are actually rather easy to use.<p>The hard part is optimization, because the GPU architecture (SIMD / SIMT) is so alien compared to normal CPUs.<p>Here's a step-by-step example of one guy optimizing a Matrix Multiplication scheme in OpenCL (specifically for NVidia GPUs): <a href="https://cnugteren.github.io/tutorial/pages/page1.html" rel="nofollow">https://cnugteren.github.io/tutorial/pages/page1.html</a><p>Just like how high-performance CPU computing requires a deep understanding of cache and stuff... high-performance GPU computing requires a deep understanding of the various memory-spaces on the GPU.<p>------------<p>Now granted: deep optimization of routines on CPUs is similarly challenging, and actually undergoes a very similar process in how to partition your work problem into L1-sized blocks. But high-performance GPUs not only have to consider their L1 Cache... but also "Shared" (or OpenCL __local) memory and "Register" (or OpenCL __private) memory as well. Furthermore, GPUs in my experience have way less memory than CPUs per thread/shader. IE: Intel "Sandy Bridge" CPU has 64kb L1 cache per core, which can be used as 2-threads if hyperthreading is enabled. A "Pascal" GPU has 64kb of "Shared" memory, which is extremely fast like L1 cache. But this 64kb is shared between 64 FP32 cores!!!.<p>Furthermore, not all algorithms run faster on GPGPUs either. For example:<p><a href="https://askeplaat.files.wordpress.com/2013/01/ispa2015.pdf" rel="nofollow">https://askeplaat.files.wordpress.com/2013/01/ispa2015.pdf</a><p>This paper claims that their GPGPU implementation (Xeon Phi) was slower than the CPU implementation! Apparently, the game of "Hex" is hard to parallelize / vectorize.<p>---------------<p>Now don't get me wrong, this is all very cool and stuff. Making various programming tasks easier is always welcome. Just be aware that GPUs are no silver bullet for performance. It takes a lot of work to get "high-performance code", regardless of your platform.<p>And sometimes, CPUs are faster.
> Julia has recently gained support for syntactic loop fusion, where chained vector operations are fused into a single broadcast<p>Wow. That's very impressive.<p>I hope one day we get this sort of tooling with AMD GPUs.