Title appears incorrect; from what I can see there's no claim of this work being faster than cuBLAS in the article. There are some claimed speedups relative to clBLAS [note: "cl" not "cu"], and some references to other work which claims speedups over cuBLAS.