This is all about achieving a dataflow architecture by exploiting CPU cache hierarchy, right?<p>We're basically talking about going <i>from</i> a design where 10k little "tasklet" state machines are each scheduled onto "scheduler" threads; where during its turn of the loop, each tasklet might run whatever possibly-"cold" logic it likes...<p>...and turning it <i>into</i> a collection of cores that each have a thread for evaluating a particular state-machine state pinned, where each "tasklet" that gets scheduled onto a given core will always be doing the same logic, so that logic can stay "hot" in that core's cache—in other words, so that each core can function closer to a SIMD model.<p>Effectively, this is the same thing done in game programming when you lift everything that's about to do a certain calculation (i.e. is in a certain FSM state) up into a VRAM matrix and run a GPGPU kernel over it†. This kernel is your separate "core" processing everything in the same state.<p>Either way, it adds up to <i>dataflow architecture</i>: the method of eliminating the overhead of context-switching and scheduling on general-purpose CPUs, by having specific components (like "I/O coprocessors" on mainframes, or (de)muxers in backbone switches) for each step of a pipeline, where that component can "stay hot" by doing exactly and only what it does in a synchronous manner.<p>The difference here is that, instead of throwing your own microcontrollers or ASICs at the problem, you're getting 80% of the same benefit from just using a regular CPU core but making it avoid executing any non-local jumps: which is to say, not just eliminating OS-level scheduling, but eliminating any sort of top-level per-event-loop switch that might jump to an arbitrary point in your program.<p>This is way more of a win for CPU programming than just what you'd expect by subtracting the nominal time an OS context-switch takes. Rewriting your logic to run as a collection of these "CPU kernels"—effectively, restricting your code the same way GPGPU kernels are restricted, and then just throwing it onto a CPU core—keeps any of the kernel's cache-lines from being evicted, and builds up (and never throws away) an excellent stream of branch-prediction metadata for the CPU to use.<p>The <i>interesting</i> thing, to me, is that a compiler, or an interpreter JIT, could (theoretically) do this "kernel hoisting" for you. As long there was a facility in your language to make it clear to the compiler that a particular function <i>is</i> an FSM state transition-function, then you can code regular event-loop/actor-modelled code, and the compiler can transform it into a collection of pinned-core kernels like this as an <i>optimization</i>.<p>The compiler can even take a hybrid approach, where you have some cores doing classical scheduling for all the "miscellaneous, IO-heavy" tasklets, and the rest of the cores being special schedulers that will only be passed a tasklet when it's ready to run in that scheduler's preferred state. With an advanced scheduler system (e.g. the Erlang VM's), the profiling JIT could even notice when your runtime workload has changed to now have 10k of the same transition-function running all the time, and generate and start up a CPU-kernel-thread for it (temporarily displacing one of its misc-work classical scheduler threads), descheduling it again if the workload shifts so that it's no longer a win.<p>Personally, I've been considering this approach with GPGPU kernels as the "special" schedulers, rather than CPU cores, but they're effectively equivalent in architecture, and perhaps in performance as well: while the GPU is faster because it gets to run your specified kernel in true SIMD parallel, your (non-NUMA) CPU cores get to pass your tasklets' state around "for free", which often balances out—ramming data into and out of VRAM is expensive, and the fact that you're buffering tasklets to run on the GPGPU as a group potentially introduces a high-latency sync-point for your tasklets. Soft-realtime guarantees might be more important than throughput.<p>---<p>† A fun tangent for a third model: if your GPGPU kernel outputs a separate dataset for each new FSM state each of the data members was found to transition to, and you have <i>other</i> GPGPU kernels for each of the other state-transition functions of your FSM waiting to take those datasets and run them, then effectively you can make your whole FSM live entirely on the GPU as a collection of kernel "cores" passing tasklets back and forth, the same way we're talking about CPU cores above.<p>While this architecture probably wins over both the entirely-CPU and CPU-passing-to-GPGPU models for pure-computational workloads (which is, after all, what GPGPUs are supposed to be for), I imagine it would fall over pretty fast if you wanted to do much IO.<p>Does anyone know if GPUs marketed specifically as GPGPUs, like the Tesla cards, have a means for low-latency access to regular virtual memory from within the GPGPU kernel? If they did, staying entirely on the GPGPU would definitely be the dominant strategy. At that point, it might even make sense to have, for example, a GPGPU Erlang VM, or entirely-in-GPGPU console emulators (imagine MAME's approach to emulated chips, but with each chip as a GPGPU kernel.)<p>If you can get that, then effectively what you've got at that point is less a GPU, and more a meta-FPGA co-processor with "elastic allocation" of gate-arrays to your VHDL files. System architecture would likely change <i>a lot</i> if we ever got <i>that</i> in regular desktop PCs.