A strong GPU programmer can do this instinctively. But thinking about it deeper, I think this is exactly the kind of project that demonstrates what is so different about GPUs vs CPU programming.<p>In CPU programming, just sequentially outputting printf one-at-a-time works. And you can optimize this to go faster-and-faster.<p>In GPU programming, its innately a parallel job. Any printf call will innately cause all 32-threads of your warp to call printf at the same time (at a minimum!!), which immediately requires all 32 threads to coordinate. Bonus points for expanding out to 1024-threads of a workgroup or cross-workgroup atomics with a wider CUDA-grid.<p>This "printf" is simple enough to a typical programmer, and yet hits so many advanced parallelization problems (so "reserving" bufferspace and other such tricks to help make the parallel-back-into-sequential for CPU land)