I just discovered that MSVC++ can automatically run each iteration of a loop in parallel in separate threads. Add "#pragma loop(hint_parallel(8))" and "#pragma loop(ivdep)" before the loop, and compile with the /Qpar option. This simple change sped up my cryoablation simulation code by 4x.
A lot of the radiology software that I write involves iterating over a 3D grid. The auto-parallelizer lets you easily process each 2D slice on a different processor core. My laptop has 2 cores x 2 hyperthreads per core = 4 virtual cores, so 4x speedup.