This seemed really cool. I'm used to a lot of new instructions & boosts, but Intel adding new conditional load/store is a smart interesting coupling that could help increase execution unit efficiency in a significant way.<p>> <i>As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates performance of such workloads. Branch predictor improvements can mitigate this to a limited extent only as data-dependent branches are fundamentally hard to predict.</i><p>> <i>To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).</i><p>> <i>Intel® APX adds conditional forms of load, store, and compare/test instructions, and it also adds an option for the compiler to suppress the status flags writes of common instructions. </i><p><a href="https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html" rel="nofollow noreferrer">https://www.intel.com/content/www/us/en/developer/articles/t...</a><p>I didn't understand everything about the "caller-saved volatile" new general purpose register interface & legacy compatibility. But some potentially really interesting optimizations where load/store being dual register capable, and being capable of staying on the AVX unit & not having to go further out to "memory" (caches?):<p>> <i>Generally, more register state will need to be managed at function boundaries. In order to reduce the associated overhead, we are adding PUSH2/POP2 instructions that transfer two register values within a single memory operation. The processor tracks these new instructions internally and fast-forwards register data between matching PUSH2 and POP2 instructions without going through memory.</i><p>Neat stuff. Very superficially reminds me of Semantic Streaming Registers on the very novel standalone-ish FPU on PULP's RISC-V based Occamy many-core chip. In that the unit is acting in a more standalone fashion. <a href="https://www.youtube.com/watch?v=kMhdq7A3d3I#t=10m">https://www.youtube.com/watch?v=kMhdq7A3d3I#t=10m</a> <a href="https://pulp-platform.org/docs/BeniniSC11-22.pdf" rel="nofollow noreferrer">https://pulp-platform.org/docs/BeniniSC11-22.pdf</a>