If you feel like you've finally groked GPU/massive parallel software programming and need more challenges, I highly recommend playing around with digital circuits! The level of parallelism available to you in hardware is truly unmatched and it's incredibly fun, especially once you start really pushing implementations of your designs on FPGAs. Granted, FPGAs are frequently less useful than what you could do on a GPU due to the higher clock speeds available on ASICs (if your GPU core clock is 3GHz and your FPGA design maxes out at 500MHz [which would be admirable!], the GPU has nearly 6x the number of cycles to match or beat your implementation!).
I know it depends on the analysis, but I often am doing somewhat embarassingly parallel things. So just knowing GNU parallel for mid-scale things (and R/python basically parallelism, although shared memory is a bear), and how to temporarily scale across the cloud to like 500 core, is huge.