Building the fastest Lua interpreter automatically

571 点作者 hr0m超过 2 年前

32 条评论

haberman超过 2 年前

> With the tail-call approach, each bytecode now gets its own function, and the pathological case for the C/C++ compiler is gone. And as shown by the experience of the Google protobuf developers, the tail-call approach can indeed be used to build very good interpreters. But can it push to the limit of hand-written assembly interpreters? Unfortunately, the answer is still no, at least at its current state.> The main blockade to the tail-call approach is the callee-saved registers. Since each bytecode function is still a function, it is required to abide to the calling convention, specifically, every callee-saved register must retain its old value at function exit.This is correct, wasting of callee-saved registers is a shortcoming of the approach I published about protobuf parsing (linked from the first paragraph above). More recently I have been experimenting with a new calling convention that uses no callee-saved registers to work around this, but the results so far are inconclusive. The new calling convention would use all registers for arguments, but allocate registers in the opposite order of normal functions, to reduce the chance of overlap. I have been calling this calling convention "reverse_cc".I need to spend some time reading this article in more detail, to more fully understand this new work. I would like to know if a new calling convention in Clang would have the same performance benefits, or if Deegen is able to perform optimizations that go beyond this. Inline caching seems like a higher-level technique that operates above the level of individual opcode dispatch, and therefore somewhat orthogonal.

评论 #33714692 未加载

评论 #33714687 未加载

评论 #33714442 未加载

saagarjha超过 2 年前

I’ve been working on an early design of a high-performance dynamic binary translator that cannot JIT, and have reached a very similar conclusion as the author. We have an existing threaded interpreter but it’s a mess of hard-to-maintain assembly for two architectures, and we run into funny issues all the time where the two diverge. Plus, being handwritten by people who are not scheduling experts, there is probably some performance left on the table because of our poor choices and the design making it difficult to write complex-but-more-performant code. Nobody wants to write an efficient hash for TLB lookups in a software MMU using GAS macros.The core point I’ve identified is that existing compilers are pretty good at converting high level descriptions of operations into architecture-specific code (at least, better than we are given the amount of instructions we have to implement) but absolutely awful at doing register selection or dealing with open control flow that is important for an interpreter. Writing everything in assembly lets you do these two but you miss out on all the nice processor stuff that LLVM has encoded into Tablegen.Anyways, the current plan is that we’re going to generate LLVM IR for each case and run it through a custom calling convention to take that load off the compiler, similar to what the author did here. There’s a lot more than I’m handwaving over that’s still going to be work, like whether we can automate the process of translating the semantics for each instruction into code, how we plan to pin registers, and how we plan to perform further optimizations on top of what the compiler spits out, but I think this is going to be the new way that people write interpreters. Nobody needs another bespoke macro assembler for every interpreter :)

评论 #33714548 未加载

qikInNdOutReply超过 2 年前

I work on a game, that mostly uses lua as logic code, saddling on a c/c++ engine. One of the engine developers implemented lua jit years ago and found that for the actual performance, the interpreter/jit was neglibable, the most expensive thing being the actual switch between luacode and c-code, which is with a large API a constant ugly background performance loss.The lua code by itself did not ran long enough to actually profit much from optimizations much.So, back then we did discuss possible solutions, and one idea was to basically notice a upcoming C-Call in bytecode ahead of execution, detect the stability of the arguments ahead of time. A background thread, extracts the values, perform the Call-processing of arguments and return pushes the values onto the stack, finally setting a "valid" bit to unblock the c-call (which by then actually is no longer a call). Both sides never have a complete cache eviction and live happily ever after. Unfortunatly i have a game dev addiction, so nothing ever came of it.But similar minds might have pushed similar ideas.. so asking the hive for this jive. Anyone ever seen something similar in the wild?

评论 #33716493 未加载

评论 #33717456 未加载

评论 #33717709 未加载

评论 #33717420 未加载

评论 #33716545 未加载

评论 #33722973 未加载

kzrdude超过 2 年前

Super interesting.I think using the name LuaJIT Remake steps on the toes of the existing LuaJIT and I would advise using a more original name.Anything distinctive would be better. Lua DeeJIT comes to mind, since the generator is called Deegen.

评论 #33713457 未加载

gatane超过 2 年前

>More importantly, it is the world’s fastest Lua interpreter to date, outperforming LuaJIT’s interpreter by 28% and the official Lua interpreter by 171% on average on a variety of benchmarksWait what the hell

评论 #33712463 未加载

mraleph超过 2 年前

It would be interesting to see 1-1 comparison with LuaJIT interpreter _without corresponding runtime changes_. That would given a meaningful baseline for how much you potentially loose/gain using this approach.Current measurement presents comparison of two wildly different implementation strategies: LuaJIT Remake uses IC and hidden classes while LuaJIT proper does not. It's a well known fact that ICs+hidden classes can lead to major performance improvements. That insight originated in Smalltalk world and was brought to dynamic language mainstream by V8† - but unfortunately it muddles the comparison here.† V8 was open sourced on September 2, 2008... Coincidentally (though probably not :)) JSC landed their first IC prototype on September 1, 2008.

评论 #33717498 未加载

abecedarius超过 2 年前

There's some prior art on DSLs for efficient interpreters, like Anton Ertl's vmgen. <a href="https://www.complang.tuwien.ac.at/anton/vmgen/" rel="nofollow">https://www.complang.tuwien.ac.at/anton/vmgen/</a>

评论 #33712614 未加载

评论 #33720010 未加载

评论 #33718357 未加载

tromp超过 2 年前

Nice to see Haskell make an appearance in an article about Lua and C/C++:> For example, at LLVM IR level, it is trivial to make a function use GHC calling convention (a convention with no callee-saved registers)This refers to the following section of [1]“cc 10” - GHC convention This calling convention has been implemented specifically for use by the Glasgow Haskell Compiler (GHC). It passes everything in registers, going to extremes to achieve this by disabling callee save registers. This calling convention should not be used lightly but only for specific situations such as an alternative to the register pinning performance technique often used when implementing functional programming languages. At the moment only X86 supports this convention and it has the following limitations:On X86-32 only supports up to 4 bit type parameters. No floating-point types are supported. On X86-64 only supports up to 10 bit type parameters and 6 floating-point parameters. This calling convention supports tail call optimization but requires both the caller and callee are using it.[1] <a href="https://llvm.org/docs/LangRef.html#calling-conventions" rel="nofollow">https://llvm.org/docs/LangRef.html#calling-conventions</a>

评论 #33713581 未加载

评论 #33713163 未加载

gary_0超过 2 年前

I think there's a typo where the Add() example code goes `lhs.As<tDouble>() && rhs.As<tDouble>()`. I'm assuming that should be a `+`?

评论 #33715376 未加载

fsfod超过 2 年前

The author also worked on the Copy-and-Patch Compilation paper[1], I wonder if they are going to use the same idea for ones of tiers of the JIT idk if it would beat LuaJIT's JIT for code compilation speed but the machine code could be faster.[1] <a href="https://arxiv.org/abs/2011.13127" rel="nofollow">https://arxiv.org/abs/2011.13127</a>

评论 #33742561 未加载

jonathanyc超过 2 年前

Awesome that the result is backed by benchmarks. The writing style is very clear and having actual code snippets is great. Favorited!My tldr understanding that:- writing a byte code interpreter in C++ is slow, and a big reason is callee-saved registers / the compiler doesn't optimize well when multiple operations are implemented in the same "giant switch table" function- but it's annoying to write everything in assembly- so the author made a compiler that glues together C++ implementations of byte code instructions (compiled to LLVM IR) into an interpreter, avoiding callee-saved registers (and performs other optimizations)It'd be interesting to see ablation testing for the other optimizations, like fuzing indices into op codes. My bet would be that avoiding having to restore registers dominated of the performance impact--that was the case for the compiler I implemented with friends in college.

fasteo超过 2 年前

It is my understanding that benchmarks are run against LuaJIT interpreter (-j off), so here are some JIT numbers for reference, using the array3d.lua benchmark [1]<pre><code> time luajit -j off array3d.lua real 0m1,968s user 0m1,915s sys 0m0,052s time luajit -j on array3d.lua real 0m0,128s user 0m0,068s sys 0m0,060s </code></pre> [1] <a href="https://github.com/luajit-remake/luajit-remake/blob/master/luabench/array3d.lua" rel="nofollow">https://github.com/luajit-remake/luajit-remake/blob/master/l...</a>

ufo超过 2 年前

One caveat to pay attention to... Ever since LuaJIT and PUC-Lua diverged, PUC lua has made some significant changes to the garbage collector. I wonder if that might affect the comparison vs LuaJIT for memory intensive benchmarks such as binarytrees and tablesort.

评论 #33715389 未加载

fwsgonzo超过 2 年前

I have been working on this too, and while I am not at all interested in embedding LLVM, the lack of calling conventions and lack of ability t make a direct jump is unfortunate.My biggest pet peeve though is the inability to push unlikely stuff completely to the back of the program. I mean just put it away where nobody can see it. I also want to align code blocks to 1 bytes. Not 16 bytes. Not a single compiler lets me realign the instruction handlers in a way that compresses them down.I also want to know what BOLT does that improves my interpreter by around 30%.

评论 #33716166 未加载

chubot超过 2 年前

This seems like an awesome way of writing faster interpreters – i.e. not in assembly, but in C++ snippets you stitch together with a tool.I did peek at the deegen tool a bit, and it seems quite large? <a href="https://github.com/luajit-remake/luajit-remake/tree/master/deegen" rel="nofollow">https://github.com/luajit-remake/luajit-remake/tree/master/d...</a>I would be interested in an overview of all the analysis it has to do, which as I understand is basically “automated Mike Pall”FWIW I think this is the hand-written equivalent with LuaJIT’s dynasm tool: <a href="https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x64.dasc" rel="nofollow">https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x64.dasc</a> (just under 5000 lines)Also there are several of these files with no apparent sharing, as you would get with deegen:<a href="https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x86.dasc" rel="nofollow">https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_x86.dasc</a><a href="https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_ppc.dasc" rel="nofollow">https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/vm_ppc.dasc</a>

sigzero超过 2 年前

I wonder why the target was 5.1 since 5.4 has been out for a while now?

评论 #33714089 未加载

评论 #33713983 未加载

tomcam超过 2 年前

What’s a stack spill?

评论 #33713268 未加载

krick超过 2 年前

I wonder what makes it so much worse on those 2 specific tasks. Especially k-nukleotide one.

flyingsky超过 2 年前

> More importantly, it is the world’s fastest Lua interpreter to date, outperforming LuaJIT’s interpreter by 28% and the official Lua interpreter by 171% on average on a variety of benchmarksReally huge, if true!

JZL003超过 2 年前

wow this is a great article, so readable

cornstalks超过 2 年前

Could this be used to build a better web assembly interpreter? wasm3 is the fastest Wasm interpreter I’ve found but this approach sounds very intriguing!

评论 #33715060 未加载

ufo超过 2 年前

Fascinating! I wonder if there's a way to keep this 100% inside the C ecosystem, without having to reach for an LLVM dependency...

评论 #33712513 未加载

评论 #33712331 未加载

评论 #33714718 未加载

评论 #33714234 未加载

MuffinFlavored超过 2 年前

> The problem with the “big switch-case” approach is that C/C++ compilers simply cannot handle such code well.Is this the case with Rust?

评论 #33714207 未加载

评论 #33714033 未加载

farzher超过 2 年前

Faster than the LuaJIT's interpreter... with the JIT disabled. would've been nice to make that more clear.

Vt71fcAqt7超过 2 年前

Can this be done for javascript?

评论 #33712795 未加载

评论 #33712552 未加载

bessiesboobies超过 2 年前

Do these changes work on Apple M1s too?

评论 #33714212 未加载

arc-in-space超过 2 年前

sillycross, please consider adding a rss feed to your blog. People do still use those.

bobm_kite9超过 2 年前

This sounds a lot like the approach taken by GraalVM. Can someone better versed in this area comment?

评论 #33714521 未加载

dingdingdang超过 2 年前

Very very impressive.

presheaf超过 2 年前

This is very cool.

1letterunixname超过 2 年前

PGO on a conventional executable with accumulated profiling data is likely to be more performant. JITs also have a problem because they must transmute RW data pages into RO code pages to satisfy NX. PGO enables longer-term analysis that effectively supersedes JIT. JITs are nice in theory, but don't tend to save profiling data to drive optimizations in performance-critical sections that matter.Might try LuaToCee and then apply PGO iteratively.

评论 #33715450 未加载

luablues超过 2 年前

> Lua is concise yet supports almost every language feature one can find in dynamic languagesHaving yet another terrible package manager? A non-existent ecosystem outside of checks notes game scripting and nginx? Reinvent-Everything where every developer everywhere has to reinvent everything poorly because the language is “concise” and “embeddable” and stuck in the 90s? Breaking changes between versions and interprets worse than the Python2/3 fiasco?Yessir I do them me those “features”.

评论 #33714065 未加载

评论 #33715427 未加载

评论 #33719735 未加载

评论 #33714808 未加载

32 条评论

haberman超过 2 年前

评论 #33714692 未加载

评论 #33714687 未加载

评论 #33714442 未加载

saagarjha超过 2 年前

评论 #33714548 未加载

qikInNdOutReply超过 2 年前

评论 #33716493 未加载

评论 #33717456 未加载

评论 #33717709 未加载

评论 #33717420 未加载

评论 #33716545 未加载

评论 #33722973 未加载

kzrdude超过 2 年前

评论 #33713457 未加载

gatane超过 2 年前

评论 #33712463 未加载

mraleph超过 2 年前

评论 #33717498 未加载

abecedarius超过 2 年前

评论 #33712614 未加载

评论 #33720010 未加载

评论 #33718357 未加载

tromp超过 2 年前

评论 #33713581 未加载

评论 #33713163 未加载

gary_0超过 2 年前

I think there's a typo where the Add() example code goes `lhs.As<tDouble>() && rhs.As<tDouble>()`. I'm assuming that should be a `+`?

评论 #33715376 未加载

fsfod超过 2 年前

评论 #33742561 未加载

jonathanyc超过 2 年前

fasteo超过 2 年前

ufo超过 2 年前

评论 #33715389 未加载

fwsgonzo超过 2 年前

评论 #33716166 未加载

chubot超过 2 年前

sigzero超过 2 年前

I wonder why the target was 5.1 since 5.4 has been out for a while now?

评论 #33714089 未加载

评论 #33713983 未加载

tomcam超过 2 年前

What’s a stack spill?

评论 #33713268 未加载

krick超过 2 年前

I wonder what makes it so much worse on those 2 specific tasks. Especially k-nukleotide one.

flyingsky超过 2 年前

> More importantly, it is the world’s fastest Lua interpreter to date, outperforming LuaJIT’s interpreter by 28% and the official Lua interpreter by 171% on average on a variety of benchmarksReally huge, if true!

JZL003超过 2 年前

wow this is a great article, so readable

cornstalks超过 2 年前

Could this be used to build a better web assembly interpreter? wasm3 is the fastest Wasm interpreter I’ve found but this approach sounds very intriguing!

评论 #33715060 未加载

ufo超过 2 年前

Fascinating! I wonder if there's a way to keep this 100% inside the C ecosystem, without having to reach for an LLVM dependency...

评论 #33712513 未加载

评论 #33712331 未加载

评论 #33714718 未加载

评论 #33714234 未加载

MuffinFlavored超过 2 年前

> The problem with the “big switch-case” approach is that C/C++ compilers simply cannot handle such code well.Is this the case with Rust?

评论 #33714207 未加载

评论 #33714033 未加载

farzher超过 2 年前

Faster than the LuaJIT's interpreter... with the JIT disabled. would've been nice to make that more clear.

Vt71fcAqt7超过 2 年前

Can this be done for javascript?

评论 #33712795 未加载

评论 #33712552 未加载

bessiesboobies超过 2 年前

Do these changes work on Apple M1s too?

评论 #33714212 未加载

arc-in-space超过 2 年前

sillycross, please consider adding a rss feed to your blog. People do still use those.

bobm_kite9超过 2 年前

This sounds a lot like the approach taken by GraalVM. Can someone better versed in this area comment?