I'm skeptical of the 8x speedup for several reasons, the main one being that this particular problem does not fit the paradigm of problems that work well on the GPU; the GPU cache is not used at all, and there are also many branches. You need to be able to use the cache of the GPU in your application, otherwise your performance is guaranteed to be memory-bound. The reason you want to avoid branches is that there is only one control unit per a number of cores on the GPU, which means that if some threads follow one branch they will have to stall until the other threads complete. Generally the only code that maps well to the GPU is that which contains large for loops and has good spacial locality (e.g. matrix multiplication).<p>The author is comparing a GPU to a CPU, yet the CPU is only running a single thread (supposedly, the author did not provide the CPU code used in the comparison). For a true comparison the full capability of the CPU should be exposed by means of a multithreaded application (and, as someone else has already mentioned, vector instructions such as SSE). Think performance per socket, not performance per thread.