Threads are very expensive if you start throwing C++ exceptions within them in parallel. You see the overall time to join the threads increases with each thread you add. There is a mutex in the unwinding code and as the threads grab the mutex they invalidate each other's cache line. I wrote a demo to illustrate the problem <a href="https://github.com/clasp-developers/ctak" rel="nofollow">https://github.com/clasp-developers/ctak</a><p>MacOS doesn't have this problem but Linux and FreeBSD do.
I find Eli Bendersky’s writeup [1] more useful as it actually goes closer to the details. For readers less familiar, it also makes it more clear what the time spent will depend on (how much state there is to copy). Eli’s post is actually a sub-post of his “cost of context switching” post [2] which is more often applicable (and helps answer all the questions below about threadpools).<p>[1] <a href="https://eli.thegreenplace.net/2018/launching-linux-threads-and-processes-with-clone/" rel="nofollow">https://eli.thegreenplace.net/2018/launching-linux-threads-a...</a><p>[2] <a href="https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/" rel="nofollow">https://eli.thegreenplace.net/2018/measuring-context-switchi...</a>
For CPU-bound tasks, it is best to pre-create a number of threads whose count roughly corresponds to the number logical execution cores. Every thread is then a worker with a main loop and not just spawn on-demand. Pin their affinity to a specific core and you are as close as possible to the “perfect” arrangement with minimized context switches and core-local cache data being there most of the time.
Great reminder.<p>Even if you pre-create a thread (thread pool), when the task is small enough (less than 1,000 cycles), it is less expensive to do it in place (for example, with fibers), because of the cost of context switching.
On any architecture, you may need to reduce the amount of stack space allocated for each thread to avoid running out of virtual memory<p><a href="http://www.kegel.com/c10k.html#limits.threads" rel="nofollow">http://www.kegel.com/c10k.html#limits.threads</a>
Why is there such a big difference in timing between Skylake and Rome? Something compiler specific? The number of steps required to create a thread should be identical.<p>I’ll also be interested to see the same benchmark but using pthread_create directly.
Why the relative high cost of threads on ARM? If anything, I'd imagine it is more geared towards "massive parallel" scenarios (i.e. dozens of cores).
My personal best practice is to always create a thread pool on program startup and distribute your tasks among the thread pool. I use the same best practice in all other languages too. Is this best practice sound or can it lead to problems in some corner cases?