A C++17 thread pool for high-performance scientific computing

137 pointsby pramodbiligirialmost 3 years ago

11 comments

bee_rideralmost 3 years ago

> In particular, our implementation does not utilize OpenMP or any other high-level multithreading APIs, and thus gives the programmer precise low-level control over the details of the parallelization, which permits more robust optimizations.I totally disagree with this line of thinking. The person who knows most about the hardware on which the program will be run is the person running it, not the programmer. The OpenMP API somewhat complicated in an attempt to allow the programmer to, at a high level, express ideas about data locality to the runtime.Unless we're imagining a universe in which the programmer is around to tweak their source code for every hardware platform, the idea of "giving the programmer more control" is a dead end. The programmer must be given expressiveness.Threading libraries are complicated because hardware is complicated. I mean first generation threadrippers are a little old, but they still exist: do we really want to have everybody re-write the code to handle "I have multiple packages on this node, NUMA between the dies in a given package, NUMA between the packages in the node!"

评论 #31745799 未加载

评论 #31745661 未加载

评论 #31746340 未加载

评论 #31748289 未加载

SethTroalmost 3 years ago

The paper has only a single benchmark reported from a single system where they report> In our performance test, we see a speedup of 18.2x, saturating and even surpassing this estimated theoretical upper bound.IMHO if you exceed your theoretical bound that's a sign you didn't go a good job analyzing the situation.

评论 #31749909 未加载

smallstepformanalmost 3 years ago

Yet again another threading implementation which ignores the Actor programming model. This must be a systemic failure in both academia for not teaching students about Actors, and the industry for barely talking about Actors. It’s like the old saying, those that dont understand Unix are doomed to reimplement it, poorly.For those readers that haven’t encountered Actors, its more than just a threading model. It allows messaging (important for interaction). An Actor is essentially a class with a std::vector<std::function void(…)> member, and a public messaging function which pushes the message with variable arguments to the queue. The message invokes the scheduler, which sequentially invokes the queued function. It eliminates locks for the client, which is the key benefit. It keeps Actors on a specific thread to use hot caches, unless work stealing moves the Actor to a free thread. The Actor abstraction is higher level than threads.I’m sure the author will come to realise the benefits of Actors once he has a decade more experience under his belt, after which just like me he’ll be disappointed at academia and the industry for not talking about the benefits of the Actor programming model.Shameless plug : my own Actor implementation, used in real world 24h/365 embedded projects: <a href="https://github.com/smallstepforman/Medo/tree/main/Actor" rel="nofollow">https://github.com/smallstepforman/Medo/tree/main/Actor</a>

评论 #31749548 未加载

评论 #31751021 未加载

评论 #31749631 未加载

评论 #31750921 未加载

RcouF1uZ4gsCalmost 3 years ago

Looking at the code, it seems like it is more of a class project rather than a highly optimized library.It is using mutexes, condition variables, and futures to write a pretty much textbook implementation of a thread pool.However, there will be significant contention, as all the workers are reading from the same queue, and the submissions are going to the same queue.There are not multiple queues with work-stealing which is I think a minimum for a production level version of something like this.EDIT:IIRC C++ Concurrency in Action by Anthony Williams has a nice discussion of the issues of building a C++ thread pool using the C++ standard library. It does address things like contention and work-stealing.

评论 #31745354 未加载

评论 #31745006 未加载

评论 #31744643 未加载

formerly_provenalmost 3 years ago

Code: <a href="https://github.com/bshoshany/thread-pool/blob/master/BS_thread_pool.hpp#L202" rel="nofollow">https://github.com/bshoshany/thread-pool/blob/master/BS_thre...</a><a href="https://github.com/bshoshany/thread-pool/blob/master/BS_thread_pool.hpp#L334" rel="nofollow">https://github.com/bshoshany/thread-pool/blob/master/BS_thre...</a>I don't see how this is built for particularly good performance, or with many-core machines in mind.

评论 #31745635 未加载

评论 #31745801 未加载

评论 #31746165 未加载

Const-mealmost 3 years ago

Writing a thread pool which is performant and scalable is surprisingly hard, I did couple times. And now, whenever I can, I tend to avoid doing that. Instead, I prefer using the stuff already implemented by someone else, in OS userlands or runtime libraries.• C and C++ runtime libraries have OpenMP. Level of support varies across compilers but still, all mainstream ones support at least OpenMP 2.0.• .NET runtime comes with its own thread pool. The standard library provides both low-level APIs similar to the Windows’ built-in one, and higher-level things like Parallel.For remotely comparable to OpenMP.• Windows userland has another one, see CreateThreadpoolWork and SubmitThreadpoolWork APIs introduced in Vista. The API is totally different though, a programmer needs to manually submit each task to the pool.• iOS and OSX userlands also have a thread pool called Grand Central Dispatch. It’s very comparable to the Windows version.

评论 #31747291 未加载

评论 #31747654 未加载

lionkoralmost 3 years ago

This is a very primitive implementation. I like it, and it looks good, but using mutexes and condition variables will be comparatively slow. You really want to, in theory, abuse atomics in order to "simulate" condition variables, since you avoid the possible context switch to the kernel which condition variables may have (afaik).Still cool, but I know (from building and testing my own similar implementations) like a good bit performance will be lost here due to these context switches. For an example of a more complex, but much more efficient, thread pool implementation, check out this article on thread pools in Zig [1] (the language is secondary, its mostly pseudocode).It goes over the need for an actual scheduler to make thread pools efficient.1: <a href="https://zig.news/kprotty/resource-efficient-thread-pools-with-zig-3291" rel="nofollow">https://zig.news/kprotty/resource-efficient-thread-pools-wit...</a>

kolbusaalmost 3 years ago

In my experience, EIGEN's threadpool is decent. But OpenMP (edit: I mean Intel's implementation donated to LLVM) is often faster especially if threads are allowed to be affinitized to HW processors. Another promise of OpenMP that is not made by various threadpools is cooperative execution; in threadpools tasks are usually independent.However, if any part of an app uses affinitized threads, the whole app needs to be using the same thread pool, as otherwise perf will go down. In this regard, OpenMP is less composable.

talolardalmost 3 years ago

It’s been a long time since I touched c++, so pardon my naïveté. I’d have assumed that optimized thread pools were a done thing. What’s new here and why was there a gap ?

评论 #31745303 未加载

评论 #31744624 未加载

评论 #31745114 未加载

Gupiealmost 3 years ago

Perhaps slightly off topic but playing with parallel numerical computations I found that C++17/20's "parallel for" loops are significantly faster than using manually created threads.<a href="https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag" rel="nofollow">https://en.cppreference.com/w/cpp/algorithm/execution_policy...</a>

rurbanalmost 3 years ago

Lots of comments but no-one found the logical error in the paper and README, which I just fixed.