I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like <i>cutting in</i> when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.<p>The paper describing my design is "Low-Synchronization, Mostly Lock-Free,
Elastic Scheduling for Streaming Runtimes", <a href="https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduling.pdf" rel="nofollow">https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli...</a>. The source code for the product implementation is now open source. Most is in <a href="https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SPL/Runtime/ProcessingElement/ScheduledQueue.h" rel="nofollow">https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP...</a> and <a href="https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SPL/Runtime/ProcessingElement/ScheduledQueue.cpp" rel="nofollow">https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP...</a>.