There is no explanation how it works. Does it work on top of existing APIs in user space? Or is there a custom kernel driver bypassing user space?<p>I've done some high throughput streaming from HD/SSD to GPU before, and it's pretty easy to beat the naive solution but getting the most out of it would require kernel space code.<p>I was doing random access streaming of textures using memory mapped files for input and copying to persistent/coherent mapped pixel buffers on the CPU with memcpy with background threads. This was intended to take advantage of the buffer caches (works great when a page is reused) and intended for random access. If I would have been working on a sequential/full file upload, my solution would be entirely different.<p>Edit: here's the source: <a href="https://github.com/kaigai/ssd2gpu" rel="nofollow">https://github.com/kaigai/ssd2gpu</a><p>It has a custom kernel module.