This is my area of interest and I know a little about interpretation/compiling. I'm a little dubious about this (which I shouldn't be because the benchmarks speak for themselves). But a couple of doubts.<p>1. "it avoids an extra jump to the top of the loop at the end of each bytecode" – but the top of the loop is a known, hardcoded value, so the prefetcher should be able to take the unconditional jump at little or no cost.<p>2. "we have multiple dispatch points instead of a single one" which is supposed to assist branch prediction - I made a big "oooh!" at that, but thinking it over, the target of the GOTO is computed each time eg.<p><pre><code> push_local:
void* target = targets[index];
index += /* ... */
goto *target;
</code></pre>
So I don't see how the branch predictor has that much to work on. I suppose it cuts down on the possible number of targets so it's more likely to predict correctly, and reading again, that's exactly what they say "we may only see a subset of bytecodes, which increases the chance that the processor can predict the jump correctly". Hmm, I'm a bit surprised that makes that much of a difference but a quick look at the benchmarks suggest they are really pretty small. I'd expect the technique to lose its speedup as the amount of code grows.<p>Perhaps if you wanted speed then picking out common bits of code and inlining it would cut down on the branches and possibly make the code more compact, for better cache use. (that's a bit petty though, it's clearly not the target of their research).<p>I suppose if you had some spare byte codes left over and you have a commonly used instruction, perhaps 'add', you might have a few different versions of add (add1, add2, add3) which behaved identically but were scattered through the 'source' code, to help the branch predictor even more.