TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Why those particular integer multiplies?

75 pointsby luu7 months ago

6 comments

TekMol7 months ago
How can software run on different CPUs when they support different operations?<p>When you download &quot;debian-live-12.7.0-amd64-kde.iso&quot;, all the programs in the repos support all current Intel and AMD CPUs, right? Do they just target the lowest common denominator of operations? Or do they somehow adapt to the operations supported by the user&#x27;s CPU?<p>Do dynamic languages (Javascript, Python, PHP...) get a speed boost because they can compile just in time and use all the features of the user&#x27;s CPU?
评论 #41954134 未加载
评论 #41954262 未加载
评论 #41954361 未加载
评论 #41956469 未加载
评论 #41955289 未加载
评论 #41954333 未加载
评论 #42008400 未加载
评论 #41955616 未加载
评论 #41956623 未加载
RaisingSpear7 months ago
I suspect Intel uses 32x32b multipliers instead of his theorised 16x16b, just that it only has one every second lane. It lines up more closely with VPMULLQ, and it seems odd that PMULUDQ would be one uOp vs PMULLD&#x27;s two.<p>PMULLD is probably just doing 2x PMULUDQ and discarding the high bits.<p>(I tried commenting on his blog but it&#x27;s awaiting moderation - I don&#x27;t know if that&#x27;s ever checked, or just sits in the queue forever)
评论 #41956659 未加载
Const-me7 months ago
Found a bug in the article.<p>Maximum for signed bytes is +127, not +128. Minimum is correct, it&#x27;s -128.
评论 #41956433 未加载
评论 #41955412 未加载
secondcoming7 months ago
It&#x27;s a shame that SIMD is still a dark art. I&#x27;ve looked at writing a few simple algorithms with it but have to do it in my own time as it&#x27;ll be difficult to justify it with my employer. I do know that gcc is generally terrible at auto-vectorising code, clang is much better but far from perfect. Using intrinsics directly will just lead to code that&#x27;s unmaintainable by others not versed in the dark art. Even wrappers over intrinsics don&#x27;t help much here. I feel there&#x27;s a lot of efficiency being left on the table because these instructions aren&#x27;t being used more.
评论 #41955844 未加载
评论 #41956913 未加载
评论 #41957570 未加载
NooneAtAll37 months ago
&gt; PMADDUBSW produces a word result which, in turns out, does not quite work. The problem is that multiplying unsigned by signed bytes means the individual product terms are in range [-128*255, 128*255] = [-32640,32640]. Our result is supposed to be a signed word, which means its value range is [-32768,32767]. If the two individual products are either near the negative or positive end of the possible output range, the sum overflows.<p>can someone explain this to me? isn&#x27;t 32640 &lt; 32767? how&#x27;s this an overflow?
评论 #41953722 未加载
wruza7 months ago
Maybe it’s me in the morning, but for some reason it was a very hard read for the text about cpu instructions. Feels like it loads you with details for ages.
评论 #41954447 未加载