> GPUArrays never had to implement automatic differentiation explicitly to support the backward pass of the neuronal network efficiently. This is because Julia's automatic differentiation libraries work for arbitrary functions and emit code that can run efficiently on the GPU. This helps a lot to get Flux working on the GPU with minimal developer effort - and makes Flux GPU support work efficiently even for user defined functions. That this works out of the box without coordination between GPUArrays + Flux is a pretty unique property of Julia<p>Every time I read about Julia, I’m amazed. What a game changing tool.
GPGPU (general purpose gpu) programming is pretty cool. I wrote a utility to let you do it in javascript, in the browser, awhile back <a href="https://github.com/amoffat/gpgpu.js" rel="nofollow">https://github.com/amoffat/gpgpu.js</a><p>The thing to note about GPU programming is that the vast majority of overhead comes from data transfer. Sometimes, it is net faster to do the computation on the CPU, if your data set and data results are very large, even if the GPU performs each calculations faster on average due to parallelism. To illustrate, look at the benchmarks on gpgpu.js running a simple kernel:<p><pre><code> CPU: 6851.25ms
GPU Total: 1449.29ms
GPU Execution: 30.64ms
GPU IO: 1418.65ms
Theoretical Speedup: 223.59x
Actual Speedup: 4.73x
</code></pre>
The theoretical speedup excludes data transfer while actual speedup includes it. The longer you can keep your data set on the GPU to do more calculations (avoiding back and forth IO), the bigger your net speed gains are.
I'm a bit surprised to see that GPU Mandelbrot is only at best x75 faster than (sequential?) CPU. Does Julia just generate <i>really fast</i> (multicore/vectorized?) CPU code? Does it also count communication costs? Fractal computations like that are extremely GPU friendly because they involve no memory accesses at all, except for writing the final result. I would expect at least two orders of magnitude improvement over a straightforwardly written C implementation.
While having a Torch-esque GPU ndarray is great, the ability to easily write your own kernels without having to compile gnarly C++ code is what sets Julia apart from competitors IMO. Not sure if there's any other dynamic language offering anything like this.
It seems kinda weird to tout how great it is that we have CuArrays and CLArrays when CLArrays haven't been updated for 1.0 and only claims experimental support for 0.6.<p>Really hoping we see some movement on CLArrays in the near future.
If anyone wants to try out a free GPU using Google Colab/Jupyter (K80, you might run into ram allocation issues if you're not one of the lucky users who get to use the full amount) here's a quick guide to get a Julia kernel up and running: <a href="https://discourse.julialang.org/t/julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/15319" rel="nofollow">https://discourse.julialang.org/t/julia-on-google-colab-free...</a>
It doesn't really describe the fundamental difference between a GPU and a 4000-core CPU, which is that the GPU has a <i>shared program counter</i>. All the cores must execute the same instruction at each cycle.