Beating the Compiler

195 点作者 mkeeter11 个月前

18 条评论

gumby11 个月前

IMHO the best way to think of it (well, it's how I have long thought of it) is two lemmas:1 - The compiler has more "fingers" than a human does. Back when I wrote programs in assembly I would use printout to keep track of what I was doing and for debugging, and so would often put a finger on the paper to mark where a jump was and then go to the destination to see if it matched up, etc. This process isn't at all scalable, so the compiler is inevitably going to do better than I ever could on anything more than something trivial.2 - I know more about the program constraints than I could ever express (much less be bothered expressing) in a HLL. "All the values will be less than 16" or "I can use this memory location for something else after I've read from it" or who knows what. So sometimes I can chop stuff out or even rewrite the compiler output locally to speed it up.Also I can only do this for a few architectures...and with modern x86 there are so many variants with all sorts of microoptimization affordances that I can rarely beat the compiler (this is a corollary of lemma 1)

评论 #40948945 未加载

评论 #40951466 未加载

评论 #40949817 未加载

phire11 个月前

> The dispatch loop takes a single indirect branch to the opcode-specific implementation. This means that the branch will be nigh unpredictable!Modern branch predictors can actually predict indirect branches with multiple destinations, because they hash recent branch history into the prediction. The exact same indirect branch will end up with multiple BTB entries, based on previous control flow.I was curious where the threaded code speedup is actually coming from. It's possible much of the speedup is coming from eliminating the extra branch back to the dispatch loop. Or maybe the history tracking in the M1's branch predictor doesn't work well for this type of control flow.So I checked out this code and modified the next macro, adding a dummy branch over a nop, to roughly isolate this factor.On my M1, the extra unconditional branch benchmarks at 1.08 sec for fib, and 1.19 sec for Mandelbrot (all other times were the same).Looks like the speedup is a mixture of the two. Eliminating that extra branch is responsible for about 20-30% of the speedup, and improving prediction on indirect branches is responsible for the other 70-80%

评论 #40953764 未加载

teo_zero11 个月前

I was intrigued by this paragraph:> Making all of the opcode implementations the same size (padding to the size of the largest opcode implementation with .balign 256), then removing the jump table entirely. This was also slower, also probably because of cache friendliness: the opcode implementations go from 16.6 KiB total to 64 KiB.Probably normalizing the opcode implementations length to the maximum length is not optimal. A possible experiment would be to normalize to the modal length. I would expect most opcodes to be arithmetic or logic operations and to share the same implementation length. The (hopefully few) more complex ones would need a trampoline.Is this a crazy idea?

interpunct11 个月前

> I've proven to my satisfaction that writing an interpreter in assembly is both fun and performant!Fun maybe/maybe not, but define "performant". I might drop into assembly for 1 or 2 orders of magnitude faster for a non-toy project. Even then, what are my requirements? What is the bus factor of LuaJIT again? Try getting support for s390x, or your patch accepted.Look at the speedup of Lua v. LuaJIT (C language VM v. C/Lua VM with JIT code gen):<a href="https://web.archive.org/web/20180430211146/https://luajit.org/performance_x86.html" rel="nofollow">https://web.archive.org/web/20180430211146/https://luajit.or...</a>

Validark11 个月前

I wish you wouldn't broadcast the sentiment contained in the first paragraph. Compilers lack the ability to consistently perform many basic optimizations to an embarrassing extent. Including even the ones you would think would be the first optimizations you'd implement when writing a compiler. Open up Godbolt and tell me if you still think the compiler knows best. I try to submit at least one issue to LLVM every time I look at the assembly it gives me. I say "try to" because sometimes it's too much to deal with.And by the way, probably the absolute worst things the compiler does are the "smart" things. I write my code knowing what the emit should look like, and the compiler sometimes thinks it has a better idea than how I wrote the code and makes the emit a lot more convoluted and slower than it would be if it just did what I said. Actually, I do know the machine decently well, thank you.Saying "centuries of engineering" is misleading. A lot of those centuries are people arguing about theoretical optimizations of dubious value while we still haven't even finished on the most basic optimizations that are obviously beneficial.The second paragraph making it sound all mysterious that compilers are even more awful at interpreters than normal code really speaks to a complete lack of exposure to the subject matter. This stuff is only esoteric because you couldn't be bothered to look at godbolt.org. Which is totally fine by the way, just don't go telling me how high quality compilers are if you've never even looked.That would be like me telling web developers that Dreamweaver produces phenomenal HTML and CSS, without ever looking at what it emits.Sorry for the rant but it just bothers me how much praise is heaped upon compilers like they are some kind of gift from God, forged by monks, wizards, and unrivaled geniuses. A compiler is just a tool. There is no magic.

评论 #40962041 未加载

评论 #40952549 未加载

评论 #40958272 未加载

20198411 个月前

Very cool! I enjoy seeing people writing asm, and letting us get the most out of our CPUs.I see you already tried what I thought of, which is getting rid of the jump table and making each instruction handler the same size. Do you think that could still work if you limited each instruction handler to 64 or 32 bytes instead of 256, and then for longer handlers jumped to a larger body of code somewhere else?

评论 #40951630 未加载

jll2911 个月前

This was a good read, thanks.Instead of using unsafe Rust to optimize the interpreter loop, I would prefer to write a transpiler that compiles the source UXN binaries to local CPU binaries, which would not require making code unsafe/less readable and would permit further speed enhancements by getting rid of an interpreter loop altogether.

评论 #40948828 未加载

JackYoustra11 个月前

Did you try PGO? This post seems like the thing it was built for.

评论 #40950717 未加载

o11c11 个月前

> Making all of the opcode implementations the same size (padding to the size of the largest opcode implementation with .balign 256), then removing the jump table entirely.Another idea would be to sort the opcodes by size (32-byte, 64-byte, 128-byte, 256-byte - or perhaps +1 to avoid the set-associativity problem below) rather than simply indexing.> This was also slower, also probably because of cache friendliness: the opcode implementations go from 16.6 KiB total to 64 KiB.This is probably cache-unfriendly due to alignment too (associativity limits mean more conflict misses for highly-aligned memory), not just size. Most documentation only talks about this problem regarding the data cache though ...

the847211 个月前

<pre><code> 0x100002ec0: b 0x100002d1c ; jump back to the dispatch loop </code></pre> I'd expect tail call based control flow - i.e. computing the jump destination at the end of each opcode function - to be nicer to the branch predictor because it could track targets separately.Currently that's hard to achieve in Rust. There's an experiment in progress to see if guaranteed tail calls can be added to the language<a href="https://github.com/rust-lang/rust/issues/112788">https://github.com/rust-lang/rust/issues/112788</a>

rerdavies11 个月前

It would be interesting to see what happens if you turn on profile-guided optimization. This seems like a very good application for PGO.The 518ms profile result for the branch-table load is very peculiar. To my eyes, it suggests that the branch-table load is is incurring cache misses at a furious rate. But I can't honestly think of why that would be. On a Pi-4 everything should fit in L2 comfortably. Are you using a low-performance ARM processor like an A53?

kosolam11 个月前

I would love to understand better this uxn thing? Is it some academic experiment or it has real world applications? What’s the deal with the additional return stack?

256_11 个月前

Why does making the FETCH part of the cycle a macro make it faster? Surely the branch predictor is fine with unconditional immediate branches. What am I missing here?Also, it has jump-to-register immediately after the instruction that sets that register. Wouldn't it be faster if it went like:<pre><code> get jmp address execute VM opcode jmp to next instruction </code></pre> So the pipeline can fetch it ahead of time?

评论 #40952108 未加载

projektfu11 个月前

Good article, brings the data to back it up.Unfortunately, it was hard to read with the monokai.css theme because comments were nearly invisible, and a lot of your information was in comments. Changing the color from #75715e to #95917e did the trick. I guess Monokai is for programmers who never read comments.

评论 #40949419 未加载

tempodox11 个月前

I find this strange:<pre><code> &mut h as *mut _ as *mut _ </code></pre> What is going on here?

评论 #40949931 未加载

评论 #40949924 未加载

评论 #40949845 未加载

0x3444ac5311 个月前

I was overjoyed to open this and find a reference to 100r and UXN

lilyball11 个月前

Does the compiler do anything differently if you stick to Rust but change the dispatch implementation to be an #[inline] fn next() that you put at the end of each opcode?

ajbt20012811 个月前

> // SAFETY: do you trust me?No.Have you seen the safer-ffi crate? Then you won't have to commit the deadly sin of writing unsafe Rust

评论 #40949513 未加载

评论 #40949715 未加载