If the problem really is that the processor is confused about instruction length, I'm impressed that this problem can be fixed in microcode without a huge performance hit: my intuition (which could be totally wrong) is that computing the length of an instruction would be something synthesized directly to logic gates.<p>Actually, come to think of it, my hunch is that the uOP decoder (presumably in hardware) is actually fine and that the microcoded optimized copy routine is trying to infer things about the uOP stream that just aren't true --- "Oh, this is a rep mov, so of course I need to go backward two uOPs to loop" or something.<p>I expect Intel's CPU team isn't going to divulge the details though. :-)