I took a GPU programming course in college and even did a year long thesis implementing a RK4 integrator to solve a particular differential equation.<p>In my thesis work, the main issue I encountered was that RK4 was a vector operation, but GPUs are matrix processors. The bottleneck in the application was the memory bandwidth, not the GPU itself. We ended up with a speedup of 16 w.r.t a single-core CPU implementation of the same problem.<p>The article claims a speedup of 35-60, but I see they also compared the GPU to a single-core CPU implementation. This is not a fair comparison. If they want to be fair, they need to utilize the full capabilities of a CPU (think performance per socket, not performance per core). I think Intel makes 18-core CPUs now; with a properly implemented multi-threaded RK4 (not very difficult) I'd expect the speedup to be closer to 2-12 instead of 35-60.