TechEcho

7 comments

_a_a_a_almost 2 years ago

This is my area of interest and I know a little about interpretation/compiling. I'm a little dubious about this (which I shouldn't be because the benchmarks speak for themselves). But a couple of doubts.1. "it avoids an extra jump to the top of the loop at the end of each bytecode" – but the top of the loop is a known, hardcoded value, so the prefetcher should be able to take the unconditional jump at little or no cost.2. "we have multiple dispatch points instead of a single one" which is supposed to assist branch prediction - I made a big "oooh!" at that, but thinking it over, the target of the GOTO is computed each time eg.<pre><code> push_local: void* target = targets[index]; index += /* ... */ goto *target; </code></pre> So I don't see how the branch predictor has that much to work on. I suppose it cuts down on the possible number of targets so it's more likely to predict correctly, and reading again, that's exactly what they say "we may only see a subset of bytecodes, which increases the chance that the processor can predict the jump correctly". Hmm, I'm a bit surprised that makes that much of a difference but a quick look at the benchmarks suggest they are really pretty small. I'd expect the technique to lose its speedup as the amount of code grows.Perhaps if you wanted speed then picking out common bits of code and inlining it would cut down on the branches and possibly make the code more compact, for better cache use. (that's a bit petty though, it's clearly not the target of their research).I suppose if you had some spare byte codes left over and you have a commonly used instruction, perhaps 'add', you might have a few different versions of add (add1, add2, add3) which behaved identically but were scattered through the 'source' code, to help the branch predictor even more.

评论 #36387461 未加载

评论 #36383265 未加载

评论 #36385204 未加载

评论 #36381194 未加载

runeholalmost 2 years ago

The key problems with this approach is not mentioned in the blog post, but is shown in figures 6 and 7 of the paper - <a href="https://stefan-marr.de/downloads/acmsac23-huang-et-al-optimizing-the-order-of-bytecode-handlers-in-interpreters-using-a-genetic-algorithm.pdf" rel="nofollow noreferrer">https://stefan-marr.de/downloads/acmsac23-huang-et-al-optimi...</a>Basically, the code handler ordering does not generalise well across benchmark nor processor, so to get the speedup they see, you'd need a specialised interpreter for your specific benchmark and processor. That puts this into the "interesting, but not very practical" category.

评论 #36378644 未加载

评论 #36379952 未加载

dddnzzz334almost 2 years ago

Didn't know goto labels can be treated as values.TIL: <a href="https://stackoverflow.com/q/1777990" rel="nofollow noreferrer">https://stackoverflow.com/q/1777990</a>

评论 #36383280 未加载

eldenringalmost 2 years ago

It would be cool to see this compared with a more simple bin packing technique, or even just stuffing the most commonly used instructions at the top and those ordered by size.

astrobe_almost 2 years ago

> Our results suggest that it can give large improvements when training for a specific benchmark.> improved the execution speed of the program for which the order was optimized between 0.8% and 23.0% with 7.7% on average.Single-digit speedup is not worth it IMO, and the wide range (<1% - 23%) probably means that you specialize for some tasks at the expense of other types of tasks, which is not for everyone's interpreter.One use I can see for it is to compile multiple inner loops and have the interpreter switch between them (probably manually), if that can get us the yummy 20% speedup. Core primitives and dispatchers are usually not the biggest part of interpreters, after all. Maybe in this scenario the core primitives could also benefit from small semantic changes, like no bounds checking for instance.

评论 #36379195 未加载

jonstewartalmost 2 years ago

I wish there was a bit more analysis of the properties/qualities of reorderings that improved performance. Also, how do the improvements intersect with profiler-guided optimization?

dvhalmost 2 years ago

the eJSVM link is broken: <a href="http://spa.info.kochi-tech.ac.jp/~ugawa/ejs-comlan-preprint.pdf" rel="nofollow noreferrer">http://spa.info.kochi-tech.ac.jp/~ugawa/ejs-comlan-preprint....</a>

评论 #36378816 未加载

7 comments

_a_a_a_almost 2 years ago

评论 #36387461 未加载

评论 #36383265 未加载

评论 #36385204 未加载

评论 #36381194 未加载

runeholalmost 2 years ago

评论 #36378644 未加载

评论 #36379952 未加载

dddnzzz334almost 2 years ago

Didn't know goto labels can be treated as values.TIL: <a href="https://stackoverflow.com/q/1777990" rel="nofollow noreferrer">https://stackoverflow.com/q/1777990</a>

评论 #36383280 未加载

eldenringalmost 2 years ago

It would be cool to see this compared with a more simple bin packing technique, or even just stuffing the most commonly used instructions at the top and those ordered by size.

astrobe_almost 2 years ago

评论 #36379195 未加载

jonstewartalmost 2 years ago

I wish there was a bit more analysis of the properties/qualities of reorderings that improved performance. Also, how do the improvements intersect with profiler-guided optimization?

dvhalmost 2 years ago

the eJSVM link is broken: <a href="http://spa.info.kochi-tech.ac.jp/~ugawa/ejs-comlan-preprint.pdf" rel="nofollow noreferrer">http://spa.info.kochi-tech.ac.jp/~ugawa/ejs-comlan-preprint....</a>

评论 #36378816 未加载

Squeezing a Little More Performance Out of Bytecode Interpreters

7 comments

Squeezing a Little More Performance Out of Bytecode Interpreters

7 comments