TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Implementing a GPU's programming model on a CPU

280 pointsby luuover 1 year ago

11 comments

samsartorover 1 year ago
I helped make a really cursed RISC-V version of this for a class project last year! The idea was to first compile each program to WASM using clang, and lower the WASM back to C but this time with all opcodes implemented in terms of the RISC-V vector intrinsics. That was a hack to be sure, but a surprisingly elegant one since 1. WASM&#x27;s structured control flow maps really well to lane masking 2. Stack and local values easily use &quot;structure of arrays&quot; layout 3. Heap values easily use &quot;array of structures&quot; layout<p>It never went anywhere but the code is still online if anyone wants to stare directly at the madness: <a href="https:&#x2F;&#x2F;gitlab.com&#x2F;samsartor&#x2F;wasm2simt" rel="nofollow noreferrer">https:&#x2F;&#x2F;gitlab.com&#x2F;samsartor&#x2F;wasm2simt</a>
评论 #37887317 未加载
评论 #37890247 未加载
raphlinusover 1 year ago
In addition to ISPC, some of this is also done in software fallback implementations of GPU APIs. In the open source world we have SwiftShader and Lavapipe, and on Windows we have WARP[1].<p>It&#x27;s sad to me that Larrabee didn&#x27;t catch on, as that might have been a path to a good parallel computer, one that has efficient parallel throughput like a GPU, but also agility more like a CPU, so you don&#x27;t need to batch things into huge dispatches and wait RPC-like latencies for them to complete. Apparently the main thing that sunk it was power consumption.<p>[1]: <a href="https:&#x2F;&#x2F;learn.microsoft.com&#x2F;en-us&#x2F;windows&#x2F;win32&#x2F;direct3darticles&#x2F;directx-warp" rel="nofollow noreferrer">https:&#x2F;&#x2F;learn.microsoft.com&#x2F;en-us&#x2F;windows&#x2F;win32&#x2F;direct3darti...</a>
评论 #37883333 未加载
bcatanzaroover 1 year ago
Matt Pharr’s series of blogs on ISPC are worth reading: <a href="https:&#x2F;&#x2F;pharr.org&#x2F;matt&#x2F;blog&#x2F;2018&#x2F;04&#x2F;30&#x2F;ispc-all" rel="nofollow noreferrer">https:&#x2F;&#x2F;pharr.org&#x2F;matt&#x2F;blog&#x2F;2018&#x2F;04&#x2F;30&#x2F;ispc-all</a>
phdelightfulover 1 year ago
One of my colleague&#x27;s Ph.D. thesis was on how to achieve high-performance CPU implementations for bulk-synchronous programming models (&quot;GPU programming&quot;)<p><a href="http:&#x2F;&#x2F;impact.crhc.illinois.edu&#x2F;shared&#x2F;Thesis&#x2F;dissertation-hee-seok_kim.pdf" rel="nofollow noreferrer">http:&#x2F;&#x2F;impact.crhc.illinois.edu&#x2F;shared&#x2F;Thesis&#x2F;dissertation-h...</a>
fulafelover 1 year ago
See also: <a href="https:&#x2F;&#x2F;ispc.github.io&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;ispc.github.io&#x2F;</a>
评论 #37880563 未加载
评论 #37879970 未加载
评论 #37880021 未加载
adrian_bover 1 year ago
This so-called GPU programming model has existed many decades before the appearance of the first GPUs, but at that time the compilers were not so good like the CUDA compilers, so the burden for a programmer was greater.<p>As another poster has already mentioned, there exists a compiler for CPUs which has been inspired by CUDA and which has been available for many years: ISPC (Implicit SPMD Program Compiler), at <a href="https:&#x2F;&#x2F;github.com&#x2F;ispc&#x2F;ispc">https:&#x2F;&#x2F;github.com&#x2F;ispc&#x2F;ispc</a> .<p>NVIDIA has the very annoying habit of using a lot of terms that are different from those that have been previously used in computer science for decades. The worst is that NVIDIA has not invented new words, but they have frequently reused words that have been widely used with other meanings.<p>SIMT (Single-Instruction Multiple Thread) is not the worst term coined by NVIDIA, but there was no need for yet another acronym. For instance they could have used SPMD (Single Program, Multiple Data Stream), which dates from 1988, two decades before CUDA.<p>Moreover, SIMT is the same thing that was called &quot;array of processes&quot; by C.A.R. Hoare in August 1978 (in &quot;Communicating Sequential Processes&quot;), or &quot;replicated parallel&quot; by Occam in 1985 or &quot;PARALLEL DO&quot; by &quot;OpenMP Fortran&quot; in 1997-10 or &quot;parallel for&quot; by &quot;OpenMP C and C++&quot; in 1998-10.<p>Each so-called CUDA kernel is just the body of a &quot;parallel for&quot; (which is multi-dimensional, like in Fortran).<p>The only (but extremely important) innovation brought by CUDA is that the compiler is smart enough so that the programmer does not need to know the structure of the processor, i.e. how many cores it has and how many SIMD lanes each core has. The CUDA compiler distributes automatically the work over the available SIMD lanes and available cores and in most cases the programmer does not care whether two executions of the function that must be executed for each data item are done on two different cores or on two different SIMD lanes of the same core.<p>This distribution of the work over SIMD lanes and over cores is simple when the SIMD operations are maskable, like in GPUs or in AVX-512 a.k.a. AVX10 or in ARM SVE. When masking is not available, like in AVX2 or Armv8-A, the implementation of conditional statements and expressions is more complicated.
评论 #37881486 未加载
评论 #37880546 未加载
评论 #37880413 未加载
评论 #37881657 未加载
评论 #37880695 未加载
TechnicolorByteover 1 year ago
&gt; This is in contrast to SIMD, or &quot;single instruction multiple data,&quot; where the programmer explicitly uses vector types and operations in their program. The SIMD approach is suited for when you have a single program that has to process a lot of data, whereas SIMT is suited for when you have many programs and each one operates on its own data<p>This statement is comparing the SIMT model to SIMD. Can anyone explain the last part about SIMT being better for many programs operating on its own data? Are they just saying you can have individual “threads” executing independently (via predication&#x2F;masks and such)?
评论 #37883291 未加载
vientover 1 year ago
Seems to be the same concept as in <a href="https:&#x2F;&#x2F;gamozolabs.github.io&#x2F;fuzzing&#x2F;2018&#x2F;10&#x2F;14&#x2F;vectorized_emulation.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;gamozolabs.github.io&#x2F;fuzzing&#x2F;2018&#x2F;10&#x2F;14&#x2F;vectorized_e...</a>, cool!
westurnerover 1 year ago
Hey, AVX-512 again!<p>&quot;Show HN: SimSIMD vs SciPy: How AVX-512 and SVE make SIMD nicer and ML 10x faster&quot; (2023-10) <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37805810">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37805810</a>
评论 #37887825 未加载
saagarjhaover 1 year ago
(Is this available anywhere?)
评论 #37883215 未加载
ameliusover 1 year ago
Isn&#x27;t this already implemented in QEMU?
评论 #37884871 未加载
评论 #37888530 未加载
评论 #37888536 未加载
评论 #37883213 未加载