If I understand correctly:<p>CPUs do minimize latency by:<p>- Register renaming<p>- Out of order execution<p>- Branch prediction<p>- Speculative execution<p>They should not be over subscribed as they have to context switch by storing / loading registers and the cache coherence protocols scale badly with more threads.<p>GPUs on the other hand maximize throughput by:<p>- A lot more memory bandwidth<p>- Smaller and slower cores, but more of them<p>- Ultra threading (the massively over subscribed hyper threading the video mentions)<p>- Context switching between wavefronts (basically the equivalent of a CPU thread), just shifts the offset into the huge register file (no store and load)<p>The one area in which CPUs are getting closer to GPUs is SIMD / SIMT. CPUs used to be able to apply one instruction to a vector of elements without masking (SIMD). In ARM SVE and x86 AVX-512 they can now (like GPUs) mask out individual lanes (SIMT) for ALU operations and memory operations (gather load / scatter store).