A bit long-winded, but I enjoy seeing more interest in real-time audio on non-traditional compute platforms.<p>I'm under the impression that more recent versions of CUDA are able to implement "streaming" kernels, where you only launch the kernel once and then continually feed it data and prices its output. Presumably this would slightly reduce latency (or variability) related to kernel launching & increase throughput if the kernel can remain resident during the time that it would previously be waiting between the end of a task and the start of the next one.