OpenMP is one of the easiest ways to make existing code run across CPU cores. In the simplest cases you simply add a single #pragma to C code and it goes N times faster. This is when you're running a function in a loop with no side effects. Some examples I've done:<p>1) ray tracing. Looping over all the pixels in an image using ray tracing to determine the color of each pixel. The algorithm and data structures are complex but don't change during the rendering. N cores is about N times as fast.<p>2) in Solvespace we had a small loop which calls a tessellation function on a bunch of NURBS surfaces. The function was appending triangles to a list, so I made a thread-local list for each call and combined them after to avoid writes to shared data structure. Again N times faster with very little effort.<p>The code is also fine to build single threaded without change if you don't have OpenMP. Your compiler will just ignore the #pragmas.
You can now (already in OpenMP5) use it to write GPU programs. Intels OneAPI uses OpenMP 5.5 to write programs for the Intel PonteVecchio GPUs which are on par to the Nvidia A100.<p><a href="https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2023-0/compiling-and-running-an-openmp-application.html" rel="nofollow">https://www.intel.com/content/www/us/en/docs/oneapi/optimiza...</a><p>gcc also provides support for NVidia and AMD GPUs<p><a href="https://gcc.gnu.org/wiki/Offloading" rel="nofollow">https://gcc.gnu.org/wiki/Offloading</a><p>Here is an example how you can use openmp for running a kernel on a nvidia A100:<p><a href="https://people.montefiore.uliege.be/geuzaine/INFO0939/notes/gpu/compileandrun/" rel="nofollow">https://people.montefiore.uliege.be/geuzaine/INFO0939/notes/...</a><p><pre><code> #include <stdlib.h>
#include <stdio.h>
#include <omp.h>
void saxpy(int n, float a, float *x, float *y) {
double elapsed = -1.0 \* omp_get_wtime();
// We don't need to map the variable a as scalars are firstprivate by default
#pragma omp target teams distribute parallel for map(to:x[0:n]) map(tofrom:y[0:n])
for(int i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
elapsed += omp_get_wtime();
printf("saxpy done in %6.3lf seconds.\n", elapsed);
}
int main() {
int n = 2000000;
float *x = (float*) malloc(n*sizeof(float));
float *y = (float*) malloc(n*sizeof(float));
float alpha = 2.0;
#pragma omp parallel for
for (int i = 0; i < n; i++) {
x[i] = 1;
y[i] = i;
}
saxpy(n, alpha, x, y);
free(x);
free(y);
return 0;
}</code></pre>
I've used it a while ago, but got burned by very uneven support across compilers — MSVC required special tweaks, and old GCC would create crashy code without warning.<p>It was okay for basic embarrassingly parallel for loops. I ended up not using any more advanced features, because apart from even worse compiler support, non-trivial multi-threading in C without any safeguards is just too easy to mess up.
I was just googling to see if there's any Emscripten/WASM implementation of OpenMP. The emscripten github issue [1] has a link to this "simpleomp" [2][3] where<p>> In ncnn project, we implement a minimal openmp runtime for webassembly target<p>> It only works for #pragma omp parallel for num_threads(N)<p>[1] <a href="https://github.com/emscripten-core/emscripten/issues/13892">https://github.com/emscripten-core/emscripten/issues/13892</a><p>[2] <a href="https://github.com/Tencent/ncnn/blob/master/src/simpleomp.h">https://github.com/Tencent/ncnn/blob/master/src/simpleomp.h</a><p>[3] <a href="https://github.com/Tencent/ncnn/blob/master/src/simpleomp.cpp">https://github.com/Tencent/ncnn/blob/master/src/simpleomp.cp...</a>