Love to see negative results published, so so important.<p>Please let's all go towards research procedure that enforces the submission of the hypothesis before any research is allowed to commence and includes enforced publishing regardless of results.
I think the missing piece here is that JavaScriptCore (JSC) and other such systems don't just use inline caching to speed up dynamic accesses; they use them as profiling feedback.<p>So, anytime you have an IC in interpreter, baseline, or lightly optimized code, then that IC is monitored to see how polymorphic it gets, and that data is fed back into the optimization pipeline.<p>Just having an IC as a dead-end, where you don't use it for profiling is way less profitable than having an IC that feeds into profiling.
Slightly orthogonal...<p>In my Sciter, that uses QuickJS (no JIT), instead of JIT I've added C compiler. That means we can add not just JS modules but C modules too:<p><pre><code> import * as cmod from "./cmodule.c"
</code></pre>
Such cmodule will be compiled and executed on the fly into native code. Idea is simple each language is good for specific tasks. JS is flexible and C is performant - just use right tool that is most optimal for a task.<p>c-modules play two major roles: FFI and number crunching code execution.<p>Sciter uses TCC compiler and runtime.<p>In total size of QuickJS + TCC binary bundle 500k + 220k = 720k.<p>For the comparison: V8 is of 40mb size.<p><a href="https://sciter.com/c-modules-in-sciter/" rel="nofollow">https://sciter.com/c-modules-in-sciter/</a>
<a href="https://sciter.com/here-we-go/" rel="nofollow">https://sciter.com/here-we-go/</a>
the people who came up with this are obviously brilliant but being french myself, I really wonder why no one is proof-reading the english, this gives an overall bad impression of the work imho
chasing inline cache micro-optimizations with dynamic binary modification is a dead end. modern CPUs are laughing at our outdated compiler tricks. maybe it's time to accept that clever hacks won’t outrun silicon.
It's good that they post negative results, but it's hard to know exactly why their attempt failed, and it's tempting for me to make guesses without doing any measurements, so let me fall for that temptation:<p>They are patching inline-cache sites in an AOT binary and not seeing improvements.<p>Only 17% of the inline-cache sites could be optimized to what they call O2 level (listing 7). Most could only be optimized to O1 level (listing 6). The only difference from the baseline (listing 5) to O1 is that they replaced:<p>mov 0x101c(%rip), %rax # load the offset<p>with<p>mov 0x3, %rax # load the offset<p>I'm not very suprised that this did not help much. The old load is probably hoisted up and loaded into a renamed register very early, and it won't miss in the cache.<p>Basically they already have a pretty nice inline cache system at least for the monomorphic case, and messing with the exact instructions used to implement it doesn't help much. A JIT is able to do so much more, eg polymorphic cases, inlining of simple methods, and eliminating repeated checks of the same hidden class. Not to mention detecting at runtime that some unknown object is almost always an integer or a float and JITting code specialized for that.<p>People new to virtual machines often focus on the compiler, whereas the stuff that moves the needle is often around the runtime. How tagged and typed data is represented, the GC implementation, and the object layout. Eg this paper explores an interesting new tagging technique and makes a huge difference to performance (there's some author overlap): <a href="https://www.researchgate.net/figure/The-three-representations-in-a-tagged-object-system-here-shown-on-a-little-endian_fig1_386112036" rel="nofollow">https://www.researchgate.net/figure/The-three-representation...</a><p>Incidentally the assembly syntax in the "Attempt to catch up" article is a bit confusing. It looks like the IC addresses are very close to the code, like almost on the same page. Stack overflow explains it:<p>GAS syntax for RIP-relative addressing looks like symbol + current_address (RIP), but it actually means symbol with respect to RIP.<p>There's an inconsistency with numeric literals:<p>[rip + 10] or AT&T 10(%rip) means 10 bytes past the end of this instruction<p>[rip + a] or AT&T a(%rip) means to calculate a rel32 displacement to reach a, not RIP + symbol value. (The GAS manual documents this special interpretation)
This seems poorly grounded. In fact almost three decades after the release of the Java HotSpot runtime we're still waiting for even one system to produce the promised advantages. I guess consensus is that V8 has come closest?<p>But the reality is that hand-optimized AoT builds remain the gold standard for performance work.
The paper seems to start with the bizarre assumption that AOT compilers need to "catch up" with JIT compilers and in particular that they benefit from inline caches for member lookup.<p>But the fact is that AOT compilers are usually for well-designed languages that don't need those inline caches because the designers properly specified a type system that would guarantee a field is always stored at the same offset.<p>They might benefit from a similar mechanism to predict branches and indirect branches (i.e. virtual/dynamic dispatch), but they already have compile-time profile-guided optimization and CPU branch predictors at runtime.<p>Furthermore, for branches that always go in one direction except for seldom changes, there are also frameworks like the Linux kernel "alternatives" and "static key" mechanisms.<p>So the opportunity for making things better with self-modifying code is limited to code where all those mechanisms don't work well, and the overhead of the runtime profiling is worth it.<p>Which is probably very rare and not worth bringing it a JIT compiler for.