Disagree with the characterization of CSP. CSP as I understand it has channels as the main building block for interprocess communication. Channels can be synchronous or asynchronous, and bounded or unbounded. The important point is that a channel is a value (you can pass it to a method, return it from a method) and usually has a type.<p>Actors are like a simplified CSP where each (lightweight) thread has a single input channel. In the case of Akka this mean you lose type information because control messages are mingled with data messages and you can't assign any useful type of them.<p>Disruptor is mainly a pattern and implementation for high efficiency -- big queues, minimum number of threads, and some tricks using CAS operations and the like. I wouldn't call it a model of concurrency -- it's basically a particular implementation of CSP.
Can anyone elaborate on what happens with Akka after 5 cores? In all of the timings, Akka has an equivalent exponential drop to all of the other systems - until it hits 5 cores. At that point, it levels off or goes up. Is there anything inherent in the Akka implementation that would cause this?
There should be a rethink of how our multicore computers are architected. Here is what I have to do in order to write a multicore parallel soft real-time application.<p>1) I have to scrutinize the way I use memory, such that no two threads are going to stomp on the same cache line too often. (False sharing)<p>2) The above includes the way I use locking primitives and higher level constructs that are built on them! (Lock convoying)<p>3) If I am using a sophisticated container like a PersistentMap, which is supposed to make it easy to think about concurrency, I still have to think about 1 & 2 above at the level of the container's API/contract, as well as think about how they might interleave contended data within their implementations. (Yes, Concurrency is not Parallelism. Here we see why.)<p>4) Garbage Collection -- Now I have to think about if the GC is in a separate worker thread and think about how that can result in cache line contention.<p>5) Even if you do all of the above correctly, the OS can still come along and do something counterproductive, like trying to schedule all your threads on as few cores/sockets as possible. (This is even nicknamed "Stupid Scheduling" in Linux by people who have to contend with it.) This entails yet more work.<p>6) Profiling all of the above is, as far as I can tell, still a hazy art. Nothing is a smoking gun for one of the pitfalls of multicore parallelism. One is only left with educated guesses, which means that you have to increase data gathering. Is there something like a debugger like QEMU which can simulate multicore machines and provide statistics on cache misses? Apparently, there are ways to get this information from hardware as well.<p>It would seem that Erlang has an advantage with regards to multicore parallelism, because its model is "distributed by default," so contention is severely limited, which is great for parallelism. However, coordination is severely limited as well! (I need to look at Supervisors and see what they can and cannot do.)<p>It would also seem like there's room for languages that combine these recently popular advanced concurrency tools with enough low-level power to navigate the above pitfalls, combined with a memory management abstraction that increases productivity without requiring the complications for parallelism entailed by GC. Rust, C++, and Objective-C are the only languages that somewhat fit this bill. (If only Rust were not quite so new!) Go, with its emphasis on pass by value semantics might also work for certain applications, despite its reliance on GC.
Disruptor is an <i>application</i> of the pre-emptive threading model. It certainly it interesting but to put it in the same pedestal as CSP and Actor model is wrong. (Also the stability of Disruptor approach in face of competing applications on the same machine was an issue last I checked.)
It is interesting since in Erlang data structures are functional (immutable) actor mailboxes could as well be implemented to share data instead of copying it. Large binaries are handled that way. They live in a binary memory area and are referenced via pointers. The rest of the messages are not. At some point it was deemed it was better to actually make the copy.
> <i>For maximum performance one would create one large job for each Core of the CPU used.</i><p>Dmitry Vyukov has suggested otherwise in a similar scenario using Go:<p>> <i>If you split the image into say 8 equal parts, and then one of the goroutines/threads/cores accidentally took 2 times more time to complete, then whole processing is slowed down 2x. The slowdown can be due to OS scheduling, other processes/interrupts, unfortunate NUMA memory layout, different amount of processing per part (e.g. ray tracing) and other reasons. [...] size of a work item must never be dependent on input data size (in an ideal world), it must be dependent on overheads of parallelization technology. Currently a reference number is ~100us-1ms per work item. So you can split the image into blocks of fixed size (say 64x64) and then distribute them among threads/goroutines. This has advantages of both locality and good load balancing.</i><p><a href="https://groups.google.com/d/msg/golang-nuts/CZVymHx3LNM/esYkA_YoB-MJ" rel="nofollow">https://groups.google.com/d/msg/golang-nuts/CZVymHx3LNM/esYk...</a><p>Or to put it this way: imagine that there was zero concurrency overhead. Then splitting out jobs to their minimal size would be the ideal option, as that would allow for the most smoothed out division of labour where every processor does work all the time and they are all doing work until the entire task is completed.