I'd be interested to see what each looks like if the allocations were done external to the benchmark. For that matter, it'd be interesting to see if after the allocations are factored out, if the same function could be used for cuda & cpu. From there, I'd be curious if the compiler is able to vectorize it automatically, or if it'd benefit from a @simd<p>It's also great to see how well cuda is supported in julia. I've started to pick up julia lately, and find it incredibly pleasant to work with. It feels like a lovely mix of haskell, lisp, and python, with a really nice repl.