There have been rumors that Zen could do memory renaming [1], this pretty much confirms it.<p>[1] Basically the same as register renaming, but instead of using the register file to rename architectural registers, it can rename memory instead.
Trying to understand this.<p>Using latencies from Zen 1 instruction table (see <a href="https://www.agner.org/optimize/instruction_tables.pdf" rel="nofollow">https://www.agner.org/optimize/instruction_tables.pdf</a>):<p><pre><code> mov dword [rsi], eax ; MOV m,r latency is 4
add dword [rsi], 5 ; ADD m,i latency is 6
mov ebx, dword [rsi] ; MOV r,m latency is 4</code></pre>
Total = 14<p>Each instruction depends on the result of the previous, so we need to sum all the latency figures to get the total cycle count. Is this right? How does Agner make it add up to 15?<p>Then for Zen 2:<p><pre><code> mov dword [rsi], eax ; MOV m,r latency is 0 (rather than 4,
; because it is mirrored)
add dword [rsi], 5 ; ADD m,i cannot find an entry for this.
; Looks like there's a typo in the doc.
; I guess the latency is 1.
mov ebx, dword [rsi] ; MOV r,m latency is 0</code></pre>
Total = 1<p>Again, how does Agner make it add up to 2?<p>And for Intel Skylake:<p><pre><code> mov dword [rsi], eax ; MOV m,r latency is 2
add dword [rsi], 5 ; ADD m,i - latency is 5
mov ebx, dword [rsi] ; MOV r,m latency is 2</code></pre>
Total = 9
The author wrote in another thread that<p>"If anybody has access to the new Chinese Zhaoxin processor, I would very much like to test it."<p>Will be very interesting to see how much actual changes Zhaoxin made to the VIA cores. I'd expect it to be minimum.
Another great thread from the same author: <a href="https://www.agner.org/forum/viewtopic.php?f=1&t=6" rel="nofollow">https://www.agner.org/forum/viewtopic.php?f=1&t=6</a>
Interesting, L1 caches are fast and even if compilers do register allocation they kinda rely on it being not-too-shitty so when spilling (and many compilers for higher level languages doesn't always invest too much time in reg-alloc since they might need to de-opt soon).<p>I'm curious if this change is an effect of more transistors (more space for a bigger register file) or if they're taking advantage with the microcode translation of the fact that most code doesn't use the SIMD vector registers and re-use unused parts of the register file for these memory aliases.
As a Zen2 owner I'm very disappointed in VPGATHERDD througput, that's so 2013.
On the other hand I like the loop and call instruction performance a lot.
Surprising... and a little scary. This is not something I would've expected to be done in the current world of multiple cores. I wonder if things like volatile and lock-free algorithms would behave any differently or even break.
I sense this question is pretty elementary, but maybe someone can point me in the right direction for reading:<p>"When the CPU recognizes that the address [rsi] is the same in all three instructions..."<p>Is there another abstraction layer like some CPU code that runs that would do the "recognition" or is this "recognition" happening as a result of logic gates connected in a certain static way?<p>To put more broadly: I'm really interested in understanding where the rubber meets the road. What "code" or "language" is being run directly on the hardware logic encoded as connections of transistors?
Ok, so most my asm coding and knowledge of exactly what the CPU was doing ended sometime between the Z-80/68K/8086 timeframes. Are there any good books/resources on all the modern trickery that CPUs now utilize?