When you compile stuff in Visual C++ with /arch:AVX or /arch:AVX2 setting, the compiler is using VEX for everything, including SSE instructions which can be encoded the old way. This eliminates the described problem completely. Also eliminates a few others as well, e.g. VEX-encoded unaligned memory access is guaranteed to work, with SSE encoding these same instructions might fail with “unaligned access” runtime exception.<p>AVX1 and VEX are supported by vast majority of CPUs made after 2011. Steam hardware survey says if you require AVX1, you gonna loose less than 7% of the audience. Maybe it’s time to move on already, and drop support for pre-AVX1 CPUs?<p>In one of my current projects (started as a green field about 1.5 years ago) the customer was willing to sacrifice compatibility for development cost, I told them I’ll ship sooner if we require AVX2 and D3D feature level 11.0 hardware, they agreed. Over these 1.5 years they only hit that once, trying to run my software on some 12-core Ivy Bridge Xeon, it doesn’t have AVX2 but it does support AVX1 and therefore VEX.
If I'm understanding the linked bug report right, this is just plain unpleasant. So, running SSE code mixed with AVX code has a penalty unless you zero out the upper parts of the registers in a special way when switching from AVX to SSE. This seems annoying but no big deal - just save the registers your function needs to preserve using AVX instructions, zero them out, do your SSE computations, and restore them before returning. Not so fast! If the calling code was legacy SSE rather than AVX, this will mark the upper halves of the registers as dirty and incur a performance penalty after the function returns, even though the offending registers have the same contents at the architectural level as they did before the call and the only AVX code is the code intended to clear out and restore the upper halves of the registers in order to prevent exactly this. Apparently, the fix is to test if the upper halves of the registers are zero and use different code paths to save and restore them depending on whether they are. What a mess.
Is there any hope of actually getting the promised performance out of modern hardware without having a CPU/assembly expert on your team?<p>The situation is ridiculous. I've played games that are simple reimplementations of DOS games (not via an emulator, but using SDL) that use 20% of a modern 2GHz CPU, while the original ran on a 12MhZ machine.