TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Understanding SIMD: Infinite complexity of trivial problems

257 pointsby verdagon6 months ago

13 comments

dragontamer6 months ago
Intel needs to see what has happened to their AVX instructions and why NVidia has taken over.<p>If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs rather than being forced to write and rewrite in SSE vs AVX vs AVX512.<p>GPU SIMD is still SIMD. Just... better at it. I think AMD and Intel GPUs can keep up btw. But software advantage and long term benefits of rewriting into CUDA are heavily apparent.<p>Intel ISPC is a great project btw if you need high level code that targets SSE, AVX, AVX512 and even ARM NEON all with one codebase + auto compiling across all the architectures.<p>-------<p>Intels AVX512 is pretty good at a hardware level. But software methodology to interact with SIMD using GPU-like languages should be a priority.<p>Intrinsics are good for maximum performance but they are too hard for mainstream programmers.
评论 #42275267 未加载
评论 #42280666 未加载
评论 #42275514 未加载
评论 #42284176 未加载
评论 #42275232 未加载
评论 #42275980 未加载
评论 #42283410 未加载
评论 #42280644 未加载
评论 #42278495 未加载
评论 #42280152 未加载
评论 #42279370 未加载
Joker_vD6 months ago
&gt; SIMD instructions are complex, and even Arm is starting to look more “CISCy” than x86!<p>Thank you for saying it out loud. XLAT&#x2F;XLATB of x86 is positively tame compared to e.g. vrgatherei16.vv&#x2F;vrgather.vv.
评论 #42278370 未加载
评论 #42287727 未加载
TinkersW6 months ago
You can simplify the 2x sqrts as sqrt(a*b), overall less operations so perhaps more accurate. It would also let you get rid of the funky lane swivels.<p>As this would only use 1 lane, perhaps if you have multiple of these to normalize, you could vectorize it.
评论 #42277913 未加载
EVa5I7bHFq9mnYK6 months ago
C# vectors do a great job of simplifying those intrinsics in a safe and portable manner.
评论 #42276164 未加载
marmaduke6 months ago
My approach to this is to write a bunch of tiny “kernels” which are obvious to SIMD and then inline them all, and it does a pretty good job on x86 and arm<p><a href="https:&#x2F;&#x2F;github.com&#x2F;maedoc&#x2F;tvbk&#x2F;blob&#x2F;nb-again&#x2F;src&#x2F;util.h">https:&#x2F;&#x2F;github.com&#x2F;maedoc&#x2F;tvbk&#x2F;blob&#x2F;nb-again&#x2F;src&#x2F;util.h</a>
kristianp6 months ago
&gt; Let&#x27;s explore these challenges and how Mojo helps address them<p>You&#x27;ve not linked to or explained what Mojo is. There&#x27;s also a lot going on with different products mentioned: Modular, Unum cloud, SimSIMD that are not contextualised either. While I&#x27;m at it, where do the others come in (Ovadia, Lemire, Lattner), you all worked on SimSIMD, I guess?<p>That said, this is a great article, thanks.<p>Edit: Mojo is a programming language with python-like syntax, and is a product by Modular: <a href="https:&#x2F;&#x2F;github.com&#x2F;modularml&#x2F;mojo">https:&#x2F;&#x2F;github.com&#x2F;modularml&#x2F;mojo</a>
评论 #42288699 未加载
juancn6 months ago
The main problem is that there are no good abstractions in popular programming languages to take advantage of SIMD extensions.<p>Also, the feature set being all over the place (e.g. integer support is fairly recent) doesn&#x27;t help either.<p>ISPC is a good idea, but execution is meh... it&#x27;s hard to setup and integrate.<p>Ideally you would want to be able to easily use this from other popular languages, like Java, Python, Javascript, without having to resort to linking a library written in C&#x2F;C++.<p>Granted, language extensions may be required to approach something like that in an ergonomic way, but most somehow end up just mimicking what C++ does and expose a pseudo assembler.
评论 #42275570 未加载
评论 #42288617 未加载
评论 #42276197 未加载
评论 #42283521 未加载
评论 #42280353 未加载
评论 #42277462 未加载
rishi_devan6 months ago
Interesting article. The article mentions &quot;...the NumPy implementation illustrates a marked improvement over the naive algorithm...&quot;, but I couldn&#x27;t find a NumPy implementation in the article.
评论 #42280906 未加载
Agingcoder6 months ago
This is the first time I hear ‘hyperscalar’. Is this generally accepted ? ( I’ve been using SIMD since the MMX days so am a bit surprised )
评论 #42274919 未加载
评论 #42274964 未加载
remram6 months ago
Did they write bfloat16 and bfloat32 when they meant float16 and float32?<p>On the image: <a href="https:&#x2F;&#x2F;www.modular.com&#x2F;blog&#x2F;understanding-simd-infinite-complexity-of-trivial-problems#:~:text=bfloat16%20compared%20to%20bfloat32" rel="nofollow">https:&#x2F;&#x2F;www.modular.com&#x2F;blog&#x2F;understanding-simd-infinite-com...</a>
评论 #42281967 未加载
big-chungus46 months ago
can the authors please share the numpy code too
评论 #42281576 未加载
bob10296 months ago
I see a lot of &quot;just use the GPU&quot; and you&#x27;d often be right.<p>SIMD on the CPU is most compelling to me due to the latency characteristics. You are nanoseconds away from the control flow. If the GPU needs some updated state regarding the outside world, it takes significantly longer to propagate this information.<p>For most use cases, the GPU will win the trade off. But, there is a reason you don&#x27;t hear much about systems like order matching engines using them.
评论 #42276447 未加载
评论 #42276176 未加载
评论 #42276253 未加载
评论 #42276416 未加载
评论 #42279946 未加载
benchmarkist6 months ago
Looks like a great use case for AI. Set up the logical specification and constraints and let the AI find the optimal sequence of SIMD operations to fulfill the requirements.
评论 #42276341 未加载
评论 #42276640 未加载
评论 #42283343 未加载