Resource efficient Thread Pools with Zig

305 pointsby komuWover 3 years ago

16 comments

scott_sover 3 years ago

I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like cutting in when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.The paper describing my design is "Low-Synchronization, Mostly Lock-Free, Elastic Scheduling for Streaming Runtimes", <a href="https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduling.pdf" rel="nofollow">https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli...</a>. The source code for the product implementation is now open source. Most is in <a href="https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SPL/Runtime/ProcessingElement/ScheduledQueue.h" rel="nofollow">https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP...</a> and <a href="https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SPL/Runtime/ProcessingElement/ScheduledQueue.cpp" rel="nofollow">https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP...</a>.

Jarredover 3 years ago

Running the benchmarks locally (zig is fastest, but that's probably expected)zig (~1 week older than HEAD):<pre><code> ./zig-out/bin/qsort filling shuffling running took 177.46ms </code></pre> rust nightly:<pre><code> cargo run --release --bin qsort warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature. Finished release [optimized] target(s) in 0.02s Running `target/release/qsort` filling shuffling running took 896.91656ms cargo run --release --bin rsort warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature. Finished release [optimized] target(s) in 0.02s Running `target/release/rsort` filling shuffling running took 212.190694ms </code></pre> go 1.16.3:<pre><code> go run qsort.go filling shuffling running took 222.90356ms </code></pre> on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9

评论 #28509955 未加载

评论 #28510937 未加载

评论 #28511239 未加载

评论 #28510813 未加载

评论 #28512939 未加载

评论 #28510307 未加载

评论 #28511907 未加载

评论 #28512400 未加载

gpderettaover 3 years ago

To be very pedantic, Vyukov MPSC queue, while indeed awesome in its simplicity and performance, is not technically lock-free, not even obstruction free (and Vyukov never claimed it as such [1]): a producer halting between the atomic swap and atomic store will prevent further nodes enqueued by other producers from ever being dequeued by the consumer.[1] edit: and in fact there is a note exactly on this on the linked Vyukov page.

评论 #28510218 未加载

jorangreefover 3 years ago

Incredible post. Ticks all the boxes and exciting to see this coming to Zig. Protty has got to be a domain specialist when it comes to atomics and threads.

评论 #28509986 未加载

omazurovover 3 years ago

Increasing the size of the array 10x (100_000_000) and filling a glaring omission:Go (go1.17.1 darwin/amd64)<pre><code> took 5.591593544s took 5.285948722s took 5.218750076s took 5.239224787s took 5.131232207s </code></pre> Zig (0.9.0-dev.959+f011f1393)<pre><code> took 3.48s took 3.37s took 3.44s took 3.43s took 3.49s </code></pre> Java (openjdk version "17-ea")<pre><code> Time: 2684 ms Time: 2654 ms Time: 2840 ms Time: 2676 ms Time: 2649 ms </code></pre> MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3

评论 #28517469 未加载

kevingaddover 3 years ago

It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take seconds instead of 100ms or less.I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.

评论 #28509749 未加载

评论 #28510211 未加载

评论 #28510346 未加载

egberts1over 3 years ago

This is the first four-state-machine thread algorithm.And I like what I am seeing. Hopefully all transitions between states have been tested.Good work.

评论 #28514199 未加载

lmbover 3 years ago

What does tail latency for the Zig pool look like? It seems like any Task that ends up in one of the overflow queues will stay there until at some ring buffer is emptied.Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?

评论 #28523986 未加载

jedisct1over 3 years ago

Pretty impressive achievement!

r0f1over 3 years ago

I know this is off-topic, but does this website not look eerily familiar to dev.to?

评论 #28510343 未加载

评论 #28510181 未加载

评论 #28510155 未加载

feffeover 3 years ago

A pet peeve of mine is calling something using atomic operations lock less. It's very common but once it click that atomic operations are just locks on the instruction level it feels very wrong to call most algorithms lock less. There's still the same concerns to avoid contention when using atomic operations as with regular mutexes. A mutex lock in the uncontended case never enters the kernel and is just an atomic operation... The main strategy regardless of if using locks or atomics directly is to avoid contention.The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.

评论 #28515080 未加载

评论 #28511808 未加载

StreamBrightover 3 years ago

Is there a IoC / CSP (ala core.async in Clojure) in Zig yet? Is it planned to have something like this? I could not find too much about that online.

评论 #28513887 未加载

goodpointover 3 years ago

How does it compare to Nim's Weave thread manager?

评论 #28512348 未加载

randyrandover 3 years ago

please have Thread Priority a native feature. See GCD.

posharmaover 3 years ago

How does this compare to OpenMP? Does it support nested parallelism? (disclaimer: haven't read the post thoroughly).

ur-whaleover 3 years ago

How does it compare with a bare-metal C++ threadpool?

评论 #28509740 未加载

评论 #28510103 未加载