I wrote a parallel image processing tool (in Swift) that essentially uses Grand Central Dispatch on macOS to process all files in a directory.<p>I hit the problem that GCD would happily let me enqueue tasks for processing, but each task would need quite an important amount of memory and GPU resources (because it would use CoreFilters that support GPU acceleration "when available"), which made the computer hit swap and slow to a crawl once the available RAM was exhausted.<p>I had to create an Admission Control component that would let me enqueue tasks only if there was enough memory left to process the image without risking swap, and that part isn't as trivial as it seems, as getting the amount of available memory for a process is a bit vague on UNIX/macOS/Linux since many OSes of this kind assume you can allocate an infinite amount of memory (see linux's overcommit system) and that the Virtual Memory system will do the right thing for you. Also nobody can guess if physical RAM is going to be allocated by some other process somewhere, so I used a fairly conservative heuristic that probably left some small amount of resources unused most of the time, but would almost never slow to a crawl. Not perfect, but "good enough".
Node suffers similar problems, although I would describe them differently to the author.<p>Essentially:<p>1. All Node's async IO is lumped together into the same threadpool.<p>2. There is no distinction between the nature of each async IO task.<p>3. Async CPU tasks (fs.stat hitting the fs cache, async multi-core crypto, async native addons) complete orders of magnitude faster than async disk tasks (SSD or HDD), and these can be orders of magnitude faster than async network tasks (dns requests to a broken dns server).<p>4. There are three basic async performance profiles, fast (CPU), slow (disk), very slow (dns), but Node has no concept of this.<p>5. This leads to the Convoy effect. Imagine what happens when you race trucks, cars, and F1... all on the same race track.<p>6. The threadpool has a default size of only 4 threads, on the assumption that this reflects the typical number of CPU cores (and reduces context switches).<p>7. 4 threads is a bad default because it leads to surprising behavior (4 slow dns requests to untrusted servers are enough to DoS the process).<p>8. 4 threads is a bad default because libuv's memory cost of 128 threads is cheap.<p>9. 4 threads is a bad default because it prevents the CPU scheduler from running async CPU tasks while slow disk and slower DNS tasks are running. Concurrent CPU tasks should rather be limited to the number of cores available, while concurrent disk and DNS tasks should be given more than the number of cores available (context switches are better amortized for these).<p>10. Because everything is conflated, hard concurrency limits can't be enforced on fast, slow or slower tasks. It's all or nothing.<p>There are efforts underway to support multiple threadpools in Node (a threadpool for fast tasks sized to the number of cores, a threadpool for slow tasks sized larger, and a threadpool for slower tasks also sized larger). The goal is to get to the point where we can have separate race tracks in Node, with F1, cars and trucks on separate race tracks, controlled and raced to their full potential:<p><a href="https://github.com/libuv/libuv/pull/1726" rel="nofollow">https://github.com/libuv/libuv/pull/1726</a>
Title is misleading. The author seems to mask what they're saying in a way that makes it really easy to conclude that their point is "Thread pools are bad, so I'll show you a better way." In fact, the proposed better way is ThreadPools(tm).<p>In fact, "The Unscalable, Deadlock-Prone, Thread Pool" is only referring to a badly used ThreadPool. Like, really badly: the strawman that forms the crux of the argument is that you are only using the same ThreadPool for all of your disparate forms of work.<p>> My favourite approach assigns one global thread pool (queue) to each function or processing step.<p>Yes, creating different thread pools for different kinds of work is a much better way to use them...
This feels language specific. Java has excellent thread support for example, it's quite hard to build an application that deadlocks.<p>Thread-per-request is how everything used to be done. Most languages switched to pools because of huge fork/threading overhead.<p>Other languages have embraced async programming to deal with these issues, and use thread-per-core. But that introduces some fairly awkward coding constructions itself.<p>I believe the best approach is transparent fiber support. Java's Project Loom is a very ambitious attempt at this. And if you're okay with a "hacky" solution, you can already use fibers with Quasars code generation.<p>In java at least, the biggest barrier to fiber support is existing libraries not supporting it (looking at you JDBC/JPA). Hopefully project loom will make existing code work with little or no modifications
The author spends a lot of time describing the problem, and not a lot on the solution.<p>The solution proposed seems to be some variant on dataflow programming: <<a href="https://en.wikipedia.org/wiki/Dataflow_programming>" rel="nofollow">https://en.wikipedia.org/wiki/Dataflow_programming></a> Am I misunderstanding?
For some use cases, I like the approach of having a fixed number of main, long lived threads/processes at runtime (which matches the number of CPU cores on the machine to minimize context-switching). Where possible, the threads/processes are themseves responsible for picking up the work that must be done. Each thread/process can use hashing/sharding to figure out which portion of the state they are responsible for and only operate on those parts. These kinds of threads should be handling the vast majority of the work. But of course, for this to be possible, the work needs to be parallelizable (servers/sockets make a good use case).<p>If you have occasional I/O tasks which don't require much CPU (e.g. disk I/O), you can use a separate threadpool for those since the context switching penalty will be minimal. Context switching is mostly a problem when all the threads/processes are using a lot (and equal amount) of CPU. If a very heavy thread/process has to share CPU with a very lightweight thread/process, the context switching penalty is not significant.
> Once the problem has been identified, the solution becomes obvious: make sure the work we push to thread pools describes the resources to acquire before running the code in a dedicated thread. ... My favourite approach assigns one global thread pool (queue) to each function or processing step.<p>The author seems to do a remarkable job of basically describing actors without mentioning the word "actor". I use GPars [1] with Groovy which explicitly supports exactly what they talk about in a very elegant manner. You create a variety of actors and allocate them to different parallel groups so they cannot starve each other of resources, and most of these issues become controllable.<p>Perhaps it is the Lisp context that resists lumping state into the equation, since Actors usually carry state?<p>[1] <a href="http://www.gpars.org/" rel="nofollow">http://www.gpars.org/</a>
> The moment we perform I/O… both the futures and the generic thread pool models fall short.<p>On Windows they usually work fine. OS kernel support is the key. OS-provided thread pool scales really well. I’m talking about the modern one from threadpoolapiset.h, StartThreadpoolIo / SubmitThreadpoolWork API functions.<p>> we might want to limit network requests going to individual hosts, independently from disk reads or writes, or from database transactions.<p>There’re APIs like ReleaseSemaphoreWhenCallbackReturns and SetEventWhenCallbackReturns.