With an SPU's 256K local memory and DMA, the ideal way to use the SPU was to split the local memory into 6 sections: code, local variables, DMA in, input, output, DMA out. That way you could have async DMA in parallel in both directions while you transform your inputs to your outputs. That meant your working space was even smaller...<p>Async DMA is important because the latency of a DMA operation is 500 cycles! But, then you remember that the latency of the CPU missing cache is also 500 cycles... And, gameplay code misses cache like it was a childhood pet. So, in theory you just need to relax and get it working any way possible and it will still be a huge win. Some people even implemented pointer wrappers with software-managed caches.<p>500 cycles sounds like a lot. But, remember that the PS2 ran at 300MHz (and had a 50 cycle mem latency) while the PS3 and 360 both ran at 3.2Ghz (and both had a mem latency of 500 cycles). Both systems pushed the clock rate much higher than PCs at the time. But, to do so, "niceties" like out-of-order execution were sacrificed. A fixed ping-pong hyperthreading should be good enough to cover up half of the stall latency, right?<p>Unfortunately, for most games the SPUs ended up needing to be devoted full time to making up for the weakness of the GPU (pretty much a GeForce 7600 GT). Full screen post processing was an obvious target. But, also the vertex shaders of the GPU needed a lot of CPU work to set them up. Moving that work to the SPUs freed up a lot of time for the gameplay code.