> With the tail-call approach, each bytecode now gets its own function, and the pathological case for the C/C++ compiler is gone. And as shown by the experience of the Google protobuf developers, the tail-call approach can indeed be used to build very good interpreters. But can it push to the limit of hand-written assembly interpreters? Unfortunately, the answer is still no, at least at its current state.<p>> The main blockade to the tail-call approach is the callee-saved registers. Since each bytecode function is still a function, it is required to abide to the calling convention, specifically, every callee-saved register must retain its old value at function exit.<p>This is correct, wasting of callee-saved registers is a shortcoming of the approach I published about protobuf parsing (linked from the first paragraph above). More recently I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".<p>I need to spend some time reading this article in more detail, to more fully understand this new work. I would like to know if a new calling convention in Clang would have the same performance benefits, or if Deegen is able to perform optimizations that go beyond this. Inline caching seems like a higher-level technique that operates above the level of individual opcode dispatch, and therefore somewhat orthogonal.