I suppose similarly related to this, when I was keeping track of synchronization between two cooperative simulation threads running at different frequencies, I had a 64-bit signed integer: chip A would add chip_B_frequency * chip_A_cycles_executed; and chip B would subtract chip_A_frequency * chip_B_cycles_executed. If the value was >=0, chip A was ahead and would switch to B; and if the value was <0, chip B was ahead and would switch to A.<p>I ended up getting a noticeable speed boost just by using sync += (uint32_t)clocks * (uint64_t)frequency; ... just a simple 32-bit x 64-bit multiply was quite a bit faster than a 64-bit x 64-bit multiply. (One had to be 64-bit to prevent the multiplication from overflowing, as one value was in the MHz range and the other could be up to ~2000 or so.)<p>I've observed this on both AMD and Intel amd64 CPUs. Not sure how that'd hold up on other CPUs. As always though, profile your code first, and only consider these types of tricks in hot code areas.