if(commandQueue[itr].first().def== typeHidden
...
else if(commandQueue[itr].first().def == typeMemGateIn)<p>I don't know if it's still the case but in the past CUDA/OCL kernels would do all of the execution work for each path in the CFG and only write the results for the actual path to global memory.