> In particular, our implementation does not utilize OpenMP or any other high-level multithreading APIs, and thus gives the programmer precise low-level control over the details of the parallelization, which permits more robust optimizations.<p>I totally disagree with this line of thinking. The person who knows most about the hardware on which the program will be run is the person running it, not the programmer. The OpenMP API somewhat complicated in an attempt to allow the programmer to, at a high level, express ideas about data locality to the runtime.<p>Unless we're imagining a universe in which the programmer is around to tweak their source code for every hardware platform, the idea of "giving the programmer more <i>control</i>" is a dead end. The programmer must be given expressiveness.<p>Threading libraries are complicated because hardware is complicated. I mean first generation threadrippers are a little old, but they still exist: do we really want to have everybody re-write the code to handle "I have multiple packages on this node, NUMA between the dies in a given package, NUMA between the packages in the node!"
The paper has only a single benchmark reported from a single system where they report<p>> In our performance test, we see a speedup of 18.2x, saturating and even surpassing this estimated theoretical upper bound.<p>IMHO if you exceed your theoretical bound that's a sign you didn't go a good job analyzing the situation.
Yet again another threading implementation which ignores the Actor programming model. This must be a systemic failure in both academia for not teaching students about Actors, and the industry for barely talking about Actors. It’s like the old saying, those that dont understand Unix are doomed to reimplement it, poorly.<p>For those readers that haven’t encountered Actors, its more than just a threading model. It allows messaging (important for interaction). An Actor is essentially a class with a std::vector<std::function void(…)> member, and a public messaging function which pushes the message with variable arguments to the queue. The message invokes the scheduler, which sequentially invokes the queued function. It eliminates locks for the client, which is the key benefit. It keeps Actors on a specific thread to use hot caches, unless work stealing moves the Actor to a free thread. The Actor abstraction is higher level than threads.<p>I’m sure the author will come to realise the benefits of Actors once he has a decade more experience under his belt, after which just like me he’ll be disappointed at academia and the industry for not talking about the benefits of the Actor programming model.<p>Shameless plug : my own Actor implementation, used in real world 24h/365 embedded projects:
<a href="https://github.com/smallstepforman/Medo/tree/main/Actor" rel="nofollow">https://github.com/smallstepforman/Medo/tree/main/Actor</a>
Looking at the code, it seems like it is more of a class project rather than a highly optimized library.<p>It is using mutexes, condition variables, and futures to write a pretty much textbook implementation of a thread pool.<p>However, there will be significant contention, as all the workers are reading from the same queue, and the submissions are going to the same queue.<p>There are not multiple queues with work-stealing which is I think a minimum for a production level version of something like this.<p>EDIT:<p>IIRC C++ Concurrency in Action by Anthony Williams has a nice discussion of the issues of building a C++ thread pool using the C++ standard library. It does address things like contention and work-stealing.
Code: <a href="https://github.com/bshoshany/thread-pool/blob/master/BS_thread_pool.hpp#L202" rel="nofollow">https://github.com/bshoshany/thread-pool/blob/master/BS_thre...</a><p><a href="https://github.com/bshoshany/thread-pool/blob/master/BS_thread_pool.hpp#L334" rel="nofollow">https://github.com/bshoshany/thread-pool/blob/master/BS_thre...</a><p>I don't see how this is built for particularly good performance, or with many-core machines in mind.
Writing a thread pool which is performant and scalable is surprisingly hard, I did couple times. And now, whenever I can, I tend to avoid doing that. Instead, I prefer using the stuff already implemented by someone else, in OS userlands or runtime libraries.<p>• C and C++ runtime libraries have OpenMP. Level of support varies across compilers but still, all mainstream ones support at least OpenMP 2.0.<p>• .NET runtime comes with its own thread pool. The standard library provides both low-level APIs similar to the Windows’ built-in one, and higher-level things like Parallel.For remotely comparable to OpenMP.<p>• Windows userland has another one, see CreateThreadpoolWork and SubmitThreadpoolWork APIs introduced in Vista. The API is totally different though, a programmer needs to manually submit each task to the pool.<p>• iOS and OSX userlands also have a thread pool called Grand Central Dispatch. It’s very comparable to the Windows version.
This is a very primitive implementation. I like it, and it looks good, but using mutexes and condition variables will be comparatively slow. You really want to, in theory, abuse atomics in order to "simulate" condition variables, since you avoid the possible context switch to the kernel which condition variables may have (afaik).<p>Still cool, but I know (from building and testing my own similar implementations) like a good bit performance will be lost here due to these context switches. For an example of a more complex, but much more efficient, thread pool implementation, check out this article on thread pools in Zig [1] (the language is secondary, its mostly pseudocode).<p>It goes over the need for an actual scheduler to make thread pools efficient.<p>1: <a href="https://zig.news/kprotty/resource-efficient-thread-pools-with-zig-3291" rel="nofollow">https://zig.news/kprotty/resource-efficient-thread-pools-wit...</a>
In my experience, EIGEN's threadpool is decent. But OpenMP (edit: I mean Intel's implementation donated to LLVM) is often faster especially if threads are allowed to be affinitized to HW processors. Another promise of OpenMP that is not made by various threadpools is cooperative execution; in threadpools tasks are usually independent.<p>However, if any part of an app uses affinitized threads, the whole app needs to be using the same thread pool, as otherwise perf will go down. In this regard, OpenMP is less composable.
It’s been a long time since I touched c++, so pardon my naïveté. I’d have assumed that optimized thread pools were a done thing. What’s new here and why was there a gap ?
Perhaps slightly off topic but playing with parallel numerical computations I found that C++17/20's "parallel for" loops are significantly faster than using manually created threads.<p><a href="https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag" rel="nofollow">https://en.cppreference.com/w/cpp/algorithm/execution_policy...</a>