How can software run on different CPUs when they support different operations?<p>When you download "debian-live-12.7.0-amd64-kde.iso", all the programs in the repos support all current Intel and AMD CPUs, right? Do they just target the lowest common denominator of operations? Or do they somehow adapt to the operations supported by the user's CPU?<p>Do dynamic languages (Javascript, Python, PHP...) get a speed boost because they can compile just in time and use all the features of the user's CPU?
I suspect Intel uses 32x32b multipliers instead of his theorised 16x16b, just that it only has one every second lane.
It lines up more closely with VPMULLQ, and it seems odd that PMULUDQ would be one uOp vs PMULLD's two.<p>PMULLD is probably just doing 2x PMULUDQ and discarding the high bits.<p>(I tried commenting on his blog but it's awaiting moderation - I don't know if that's ever checked, or just sits in the queue forever)
It's a shame that SIMD is still a dark art. I've looked at writing a few simple algorithms with it but have to do it in my own time as it'll be difficult to justify it with my employer. I do know that gcc is generally terrible at auto-vectorising code, clang is much better but far from perfect. Using intrinsics directly will just lead to code that's unmaintainable by others not versed in the dark art. Even wrappers over intrinsics don't help much here. I feel there's a lot of efficiency being left on the table because these instructions aren't being used more.
> PMADDUBSW produces a word result which, in turns out, does not quite work. The problem is that multiplying unsigned by signed bytes means the individual product terms are in range [-128*255, 128*255] = [-32640,32640]. Our result is supposed to be a signed word, which means its value range is [-32768,32767]. If the two individual products are either near the negative or positive end of the possible output range, the sum overflows.<p>can someone explain this to me? isn't 32640 < 32767? how's this an overflow?
Maybe it’s me in the morning, but for some reason it was a very hard read for the text about cpu instructions. Feels like it loads you with details for ages.