TechEcho

11 comments

If you do more than microbenchmarking, then the cache effects start showing up and often the smaller-yet-individually-slower sequence begins to win.But I disagree that the 3 sequences are actually identical in semantics, because the ones containing adds and xors will also affect the flags, while xlat and movs with the arithmetic done in the addressing mode don't.The other thing to note is that pushes and pops are essentially free despite containing both a memory access and arithmetic --- I believe they added a special "stack engine" to make this fast starting with the P6.I remember benchmarking AAD/AAM and they were basically exactly the same as the longer equivalent sequences, although that was on a 2nd generation i7. The (relative) timings do change a little between CPUs, but it seems that Intel mostly tries to optimise them every time so they're not all that much slower. It would be interesting to see this benchmark done on some other CPU models (e.g. AMDs, which tend to have very different relative timings, or something like an Atom or even NetBurst.)

评论 #29302189 未加载

评论 #29302086 未加载

jeroenhdover 3 years ago

For an instruction only left in for backwards compatibility, I think the microcode is quite nicely optimized. Sure, it could be faster, but it beat two more naive implementations despite originating from the 386 days.I do wonder, though, if there could still be some hidden gems hidden deep in the legacy instructions that compilers could make use of for some very peculiar algorithms.

评论 #29301826 未加载

评论 #29301569 未加载

b5nover 3 years ago

> The meme is wrongThe third panel is generally meant to be the correct technical answer, while the last panel is reserved for the punchline.Understanding the 'galaxy brain' format might have saved the author the trouble (or at least guided proper expectations), although it was a cool exercise.

评论 #29303156 未加载

评论 #29301820 未加载

评论 #29305624 未加载

评论 #29302009 未加载

oshiar53-0over 3 years ago

>The meme is wrongNah, it's rather that the meme is correctly absurd, as intended.

oshiar53-0over 3 years ago

Wouldn't "movzx ecx, al" save one byte of rex.W prefix? Just wondering.

评论 #29302079 未加载

kccqzyover 3 years ago

Why didn't the author benchmark the one-instruction equivalent MOV AL,[RBX+AL] that the author uses to explain XLATB? How would its performance differ from the third sequence going through RCX?

评论 #29302808 未加载

DeathArrowover 3 years ago

>However, since that time, all modern CPUs have turned RISC-like, by internally using a reduced instruction set and translating the ISA opcodes into internal commands, some implemented using CPU microcode.Is there a way Intel can expose microcode and commands to outside so compilers can directly target them instead of X86 instruction set?If yes, would there be anything to gain or lose?

评论 #29305267 未加载

评论 #29304274 未加载

评论 #29304476 未加载

评论 #29304237 未加载

评论 #29304443 未加载

jstanleyover 3 years ago

> what are the chances this obscure opcode is faster than optimized loads?Sometimes it's not about being faster, sometimes it's about taking up less space. The graphic doesn't say what it's aiming for, and based on what I see in the graphic, the 4th panel seems to take up the least space.

kaysonover 3 years ago

Would someone mind explaining what all the assembly instructions in the meme do? In particular I'm wondering why you would do xor rcx, rcx when that result is always 0

评论 #29302409 未加载

评论 #29303049 未加载

celrodover 3 years ago

Hmm, uiCA results: xlatb: <a href="https://bit.ly/3cyBNN5" rel="nofollow">https://bit.ly/3cyBNN5</a> sequence: <a href="https://bit.ly/3nCmVTX" rel="nofollow">https://bit.ly/3nCmVTX</a>xlatb is looking better here. There are also some front end concerns that may favor xlatb, in particular if it's friendlier to the decoder. xlat is also fewer muops, taking less of the muop cache once decoded.

oblibover 3 years ago

>>nerd snipedI honestly don't know anything about this stuff, but the title is awesome.

评论 #29302909 未加载

评论 #29303004 未加载

评论 #29303221 未加载

11 comments

userbinatorover 3 years ago

评论 #29302189 未加载

评论 #29302086 未加载

jeroenhdover 3 years ago

评论 #29301826 未加载

评论 #29301569 未加载

b5nover 3 years ago

评论 #29303156 未加载

评论 #29301820 未加载

评论 #29305624 未加载

评论 #29302009 未加载

oshiar53-0over 3 years ago

>The meme is wrongNah, it's rather that the meme is correctly absurd, as intended.

oshiar53-0over 3 years ago

Wouldn't "movzx ecx, al" save one byte of rex.W prefix? Just wondering.

评论 #29302079 未加载

kccqzyover 3 years ago

Why didn't the author benchmark the one-instruction equivalent MOV AL,[RBX+AL] that the author uses to explain XLATB? How would its performance differ from the third sequence going through RCX?

评论 #29302808 未加载

DeathArrowover 3 years ago

评论 #29305267 未加载

评论 #29304274 未加载

评论 #29304476 未加载

评论 #29304237 未加载

评论 #29304443 未加载

jstanleyover 3 years ago

kaysonover 3 years ago

Would someone mind explaining what all the assembly instructions in the meme do? In particular I'm wondering why you would do xor rcx, rcx when that result is always 0

评论 #29302409 未加载

评论 #29303049 未加载

celrodover 3 years ago

oblibover 3 years ago

>>nerd snipedI honestly don't know anything about this stuff, but the title is awesome.

评论 #29302909 未加载

评论 #29303004 未加载

评论 #29303221 未加载

I got nerd sniped into benchmarking legacy x86 instructions (2019)

11 comments

I got nerd sniped into benchmarking legacy x86 instructions (2019)

11 comments