TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

SIMD Made Easy with Intel Implicit SPMD Program Compiler

31 点作者 truth_seeker超过 5 年前

4 条评论

wmu超过 5 年前
The hand-coded AVX2 procedure is far from optimal form, they waste time on horizontal addition in each iteration.<p>The First Rule of SIMD-ization says: keep all the intermediate results in vector(s), do horizontal reduction at the end.<p>Conversion from comparison result into vector of integers can be done a bit simpler: just one bit-and is needed and then cast to __m256i (casting doesn&#x27;t emit any code as SIMD registers are untyped).
truth_seeker超过 5 年前
If someone is seeking more insight into it, follow this link:<p><a href="https:&#x2F;&#x2F;www.slideshare.net&#x2F;IntelSoftware&#x2F;simple-single-instruction-multiple-data-simd-with-the-intel-implicit-spmd-program-compiler-intel-ispc" rel="nofollow">https:&#x2F;&#x2F;www.slideshare.net&#x2F;IntelSoftware&#x2F;simple-single-instr...</a>
Const-me超过 5 年前
&gt; if you want to target multiple ISAs, you need to write multiple algorithms<p>In my experience, these algorithms are similar to each other. More often than not don&#x27;t require too much extra time: a few macros here and there, a few templates, couple version of a small low-level function, etc.<p>&gt; _mm256_hadd_epi32<p>That instruction is slow, e.g. on Ryzen it has latency 7. _mm256_slli_si256 and bitwise ops have latency 1, often can do same faster.<p>&gt; readability is reduced when compared to the original scalar implementation<p>Solvable with a library, example: <a href="https:&#x2F;&#x2F;github.com&#x2F;Const-me&#x2F;IntelIntrinsics" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Const-me&#x2F;IntelIntrinsics</a>
KenanSulayman超过 5 年前
Very interesting and I can&#x27;t wait to try it out.<p>It&#x27;s a pity though, based on what I understood from the website, that it&#x27;s only producing binaries which can be linked to other than actually generating C &#x2F; C++ code. That would be great for LTO, but also allow for better inspection of the generated SIMD code prior to compilation to ensure that all code is compiled by the same compiler. I guess the best way to inspect the artefacts prior to assembly is via LLVM IR.<p>I&#x27;m pretty happy that Intel chose to implement this based on LLVM. I&#x27;d have expected this to be sitting on top of icc.
评论 #21136586 未加载