I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like <i>cutting in</i> when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.<p>The paper describing my design is "Low-Synchronization, Mostly Lock-Free,
Elastic Scheduling for Streaming Runtimes", <a href="https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduling.pdf" rel="nofollow">https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli...</a>. The source code for the product implementation is now open source. Most is in <a href="https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SPL/Runtime/ProcessingElement/ScheduledQueue.h" rel="nofollow">https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP...</a> and <a href="https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SPL/Runtime/ProcessingElement/ScheduledQueue.cpp" rel="nofollow">https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP...</a>.
Running the benchmarks locally (zig is fastest, but that's probably expected)<p>zig (~1 week older than HEAD):<p><pre><code> ./zig-out/bin/qsort
filling
shuffling
running
took 177.46ms
</code></pre>
rust nightly:<p><pre><code> cargo run --release --bin qsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/qsort`
filling
shuffling
running
took 896.91656ms
cargo run --release --bin rsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/rsort`
filling
shuffling
running
took 212.190694ms
</code></pre>
go 1.16.3:<p><pre><code> go run qsort.go
filling
shuffling
running
took 222.90356ms
</code></pre>
on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9
To be very pedantic, Vyukov MPSC queue, while indeed awesome in its simplicity and performance, is not technically lock-free, not even obstruction free (and Vyukov never claimed it as such [1]): a producer halting between the atomic swap and atomic store will prevent further nodes enqueued by other producers from ever being dequeued by the consumer.<p>[1] edit: and in fact there is a note exactly on this on the linked Vyukov page.
Incredible post. Ticks all the boxes and exciting to see this coming to Zig. Protty has got to be a domain specialist when it comes to atomics and threads.
Increasing the size of the array 10x (100_000_000) and filling a glaring omission:<p>Go (go1.17.1 darwin/amd64)<p><pre><code> took 5.591593544s
took 5.285948722s
took 5.218750076s
took 5.239224787s
took 5.131232207s
</code></pre>
Zig (0.9.0-dev.959+f011f1393)<p><pre><code> took 3.48s
took 3.37s
took 3.44s
took 3.43s
took 3.49s
</code></pre>
Java (openjdk version "17-ea")<p><pre><code> Time: 2684 ms
Time: 2654 ms
Time: 2840 ms
Time: 2676 ms
Time: 2649 ms
</code></pre>
MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3
It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take <i>seconds</i> instead of 100ms or less.<p>I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.
This is the first four-state-machine thread algorithm.<p>And I like what I am seeing. Hopefully all transitions between states have been tested.<p>Good work.
What does tail latency for the Zig pool look like? It seems like any Task that ends up in one of the overflow queues will stay there until at some ring buffer is emptied.<p>Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?
A pet peeve of mine is calling something using atomic operations lock less. It's very common but once it click that atomic operations are just locks on the instruction level it feels very wrong to call most algorithms lock less. There's still the same concerns to avoid contention when using atomic operations as with regular mutexes. A mutex lock in the uncontended case never enters the kernel and is just an atomic operation... The main strategy regardless of if using locks or atomics directly is to avoid contention.<p>The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.
Is there a IoC / CSP (ala core.async in Clojure) in Zig yet? Is it planned to have something like this? I could not find too much about that online.