The "core" of the trick is nice : amortizing interpreter dispatch over many items.
(ignoring the column layout/SIMD stuff which basically helps in any case)<p>Essentially it's turning :<p><pre><code> LOAD
DISPATCH
OP1
DISPATCH
OP2
... (once per operation in the expression)
STORE
... (once per row)
</code></pre>
into<p><pre><code> DISPATCH
LOAD
OP1
STORE
LOAD
OP1
STORE
... (once per row)
DISPATCH
... (once per operation in the expression)
</code></pre>
The nice trade-off here is that you don't require code generation to do that, but it's still not optimal.<p>If you can generate code it's even better to fuse the operations, to get something like :<p><pre><code> LOAD
OP1
OP2
...
STORE
LOAD
...
</code></pre>
It helps because even though you can tune your batch size to get mostly cached loads and stores, it's still not free.<p>For example on Haswell you can only issue 1 store per cycle, so if OP is a single add you're leaving up to 3/4 of your theoretical ALU throughput on the table.