I helped make a really cursed RISC-V version of this for a class project last year! The idea was to first compile each program to WASM using clang, and lower the WASM back to C but this time with all opcodes implemented in terms of the RISC-V vector intrinsics. That was a hack to be sure, but a surprisingly elegant one since
1. WASM's structured control flow maps really well to lane masking
2. Stack and local values easily use "structure of arrays" layout
3. Heap values easily use "array of structures" layout<p>It never went anywhere but the code is still online if anyone wants to stare directly at the madness: <a href="https://gitlab.com/samsartor/wasm2simt" rel="nofollow noreferrer">https://gitlab.com/samsartor/wasm2simt</a>
In addition to ISPC, some of this is also done in software fallback implementations of GPU APIs. In the open source world we have SwiftShader and Lavapipe, and on Windows we have WARP[1].<p>It's sad to me that Larrabee didn't catch on, as that might have been a path to a good parallel computer, one that has efficient parallel throughput like a GPU, but also agility more like a CPU, so you don't need to batch things into huge dispatches and wait RPC-like latencies for them to complete. Apparently the main thing that sunk it was power consumption.<p>[1]: <a href="https://learn.microsoft.com/en-us/windows/win32/direct3darticles/directx-warp" rel="nofollow noreferrer">https://learn.microsoft.com/en-us/windows/win32/direct3darti...</a>
Matt Pharr’s series of blogs on ISPC are worth reading:
<a href="https://pharr.org/matt/blog/2018/04/30/ispc-all" rel="nofollow noreferrer">https://pharr.org/matt/blog/2018/04/30/ispc-all</a>
One of my colleague's Ph.D. thesis was on how to achieve high-performance CPU implementations for bulk-synchronous programming models ("GPU programming")<p><a href="http://impact.crhc.illinois.edu/shared/Thesis/dissertation-hee-seok_kim.pdf" rel="nofollow noreferrer">http://impact.crhc.illinois.edu/shared/Thesis/dissertation-h...</a>
This so-called GPU programming model has existed many decades before the appearance of the first GPUs, but at that time the compilers were not so good like the CUDA compilers, so the burden for a programmer was greater.<p>As another poster has already mentioned, there exists a compiler for CPUs which has been inspired by CUDA and which has been available for many years: ISPC (Implicit SPMD Program Compiler), at <a href="https://github.com/ispc/ispc">https://github.com/ispc/ispc</a> .<p>NVIDIA has the very annoying habit of using a lot of terms that are different from those that have been previously used in computer science for decades. The worst is that NVIDIA has not invented new words, but they have frequently reused words that have been widely used with other meanings.<p>SIMT (Single-Instruction Multiple Thread) is not the worst term coined by NVIDIA, but there was no need for yet another acronym. For instance they could have used SPMD (Single Program, Multiple Data Stream), which dates from 1988, two decades before CUDA.<p>Moreover, SIMT is the same thing that was called "array of processes" by C.A.R. Hoare in August 1978 (in "Communicating Sequential Processes"), or "replicated parallel" by Occam in 1985 or "PARALLEL DO" by "OpenMP Fortran" in 1997-10 or "parallel for" by "OpenMP C and C++" in 1998-10.<p>Each so-called CUDA kernel is just the body of a "parallel for" (which is multi-dimensional, like in Fortran).<p>The only (but extremely important) innovation brought by CUDA is that the compiler is smart enough so that the programmer does not need to know the structure of the processor, i.e. how many cores it has and how many SIMD lanes each core has. The CUDA compiler distributes automatically the work over the available SIMD lanes and available cores and in most cases the programmer does not care whether two executions of the function that must be executed for each data item are done on two different cores or on two different SIMD lanes of the same core.<p>This distribution of the work over SIMD lanes and over cores is simple when the SIMD operations are maskable, like in GPUs or in AVX-512 a.k.a. AVX10 or in ARM SVE. When masking is not available, like in AVX2 or Armv8-A, the implementation of conditional statements and expressions is more complicated.
> This is in contrast to SIMD, or "single instruction multiple data," where the programmer explicitly uses vector types and operations in their program. The SIMD approach is suited for when you have a single program that has to process a lot of data, whereas SIMT is suited for when you have many programs and each one operates on its own data<p>This statement is comparing the SIMT model to SIMD. Can anyone explain the last part about SIMT being better for many programs operating on its own data? Are they just saying you can have individual “threads” executing independently (via predication/masks and such)?
Seems to be the same concept as in <a href="https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_emulation.html" rel="nofollow noreferrer">https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_e...</a>, cool!
Hey, AVX-512 again!<p>"Show HN: SimSIMD vs SciPy: How AVX-512 and SVE make SIMD nicer and ML 10x faster" (2023-10)
<a href="https://news.ycombinator.com/item?id=37805810">https://news.ycombinator.com/item?id=37805810</a>