Here is an example:<p>NN-512 uses pthreads. It generates a tiny, standalone, highly scalable C99 work-stealing thread pool in every C file. The work items are simply coordinates, similar to the threadIdx/blockIdx coordinates in Nvidia's CUDA. For example, all the structs and functions starting with "Example32Threader" in this file:<p><a href="https://nn-512.com/example/32#3" rel="nofollow">https://nn-512.com/example/32#3</a><p>The implementation ends at Example32ThreaderDo1<p>The key is to avoid writing to shared cache lines whenever possible, and to keep a bitmap in the central struct (the "hub") that avoids rediscovery of the fact that a thread has no remaining work to be stolen (thereby avoiding a "thundering herd" type problem: <a href="https://en.m.wikipedia.org/wiki/Thundering_herd_problem" rel="nofollow">https://en.m.wikipedia.org/wiki/Thundering_herd_problem</a>)<p>Another key is to never take any short-cuts in synchronization. Rather than doing something subtle/clever to avoid mutexes (etc.), instead make sure the units of work are big enough so that synchronization costs are negligible with respect to the overall computation