科技回声

6 条评论

mgraczyk将近 11 年前

To elaborate on the justification for the answer:<pre><code> So Intel probably shoved popcnt into the same category to keep the processor design simple </code></pre> In the processor design I work on, we do register dependency checks by partitioning all instructions into a set of "timing classes" and checking the dispatch delay needed between dependent register producers and consumers across all possible timing class pairs. The delays vary depending on available forwarding networks, resource conflicts, etc. Often times we groups instructions into sub optimal timing classes to simplify other parts of the design or just to make the dispatch logic simpler.Intel's x86 core is waaaaay more complicated than the core I work on and has far more instructions, so I it's probably safe to say that they make these suboptimal classifications often. I strongly suspect that the false dependency was intentional and not a "hardware bug" as some of the StackOverflow comments seem to suggest.

评论 #8130130 未加载

评论 #8130161 未加载

评论 #8130142 未加载

评论 #8130310 未加载

tofof将近 11 年前

TLDR: Headline (and indeed bulk of article) is phantom symptom. True cause is register allocator behavior.Specifically, allocator's handling of an instruction with a false dependency on register that's written to, coupled with multiple compilers being unaware of the false dependency.

评论 #8130043 未加载

jbondeson将近 11 年前

This is why micro-benchmarking is Russian roulette.When you distill a loop until you're finding the exact bottleneck in the system (pipelining, branch prediction, etc) you need to be very very careful you're measuring what you think you are. Otherwise you'll end up in this situation where you're benchmarking a compiler...

byuu将近 11 年前

I suppose similarly related to this, when I was keeping track of synchronization between two cooperative simulation threads running at different frequencies, I had a 64-bit signed integer: chip A would add chip_B_frequency * chip_A_cycles_executed; and chip B would subtract chip_A_frequency * chip_B_cycles_executed. If the value was >=0, chip A was ahead and would switch to B; and if the value was <0, chip B was ahead and would switch to A.I ended up getting a noticeable speed boost just by using sync += (uint32_t)clocks * (uint64_t)frequency; ... just a simple 32-bit x 64-bit multiply was quite a bit faster than a 64-bit x 64-bit multiply. (One had to be 64-bit to prevent the multiplication from overflowing, as one value was in the MHz range and the other could be up to ~2000 or so.)I've observed this on both AMD and Intel amd64 CPUs. Not sure how that'd hold up on other CPUs. As always though, profile your code first, and only consider these types of tricks in hot code areas.

userbinator将近 11 年前

It should be noted that using 64-bit operands, even in 64-bit mode, incurs an extra penalty of 1 byte per instruction, for the REX prefix. The same applies to using the extended registers (the uncreatively-named "r8" through "r15".) This is very much not noticeable for microbenchmarks, where all the code of a loop fits in the cache, but for bigger ones, the effects of icache misses can become quite significant. A smaller instruction sequence that is slower than a larger one when microbenchmarked can become much faster once that code is benchmarked as part of a whole system.

Replacing 32-bit loop variable with 64-bit introduces performance deviations

6 条评论

Replacing 32-bit loop variable with 64-bit introduces performance deviations

6 条评论