My understanding is they are plenty pipelined, though the GPU is working on a more predictable workload so the order is more likely to be rewritten by the compiler than by the silicon -- that is, the CPU tries as hard as it can to maximize single threaded performance for branchy workload and "wastes" transistors and power on that, the GPU expects branches and memory access to be more predictable and spends the transistors and power it saves to add more cores.