Compiler Optimizations are Awesome

203 pointsby turingbookalmost 8 years ago

13 comments

jcranmeralmost 8 years ago

Several years ago, I happened on a blog post where someone demonstrated a very fast variant of the N-queens solution based on hand-written, SSE vectorized that was presented as very fast. I managed to write a faster, non-vectorized C solution that was recursive and that the compiler couldn't vectorize, and much easier to understand than the vectorized original it was based off of.Turns out that main reason the vaunted "overkilled" solution was so abysmally slow was that the author happily used BSF and BTC in the hottest part of the loop... which are actually rather slow instructions, particularly when you're using them to control a branch (compare-and-jump is a fused µop in practice, but BTC-and-jump is not).The point of this tail is that if you want to absolutely wring the last clock cycle out of a hot path, you usually need good microarchitectural knowledge about which operations are going to be faster and which are not. Sure, you can beat a compiler with hand-written, hand-optimized assembly code most of the time--but the people who have the skills to write such code are going to be the people working on the compilers.The tools for optimizing compilers are getting better, and probably faster than we are capable of pumping out performance engineers to hand-craft the inner loops. In the past decade, we've seen polyhedral loop transformations become production-quality. Auto-vectorization is getting better, particularly when user-directed (think #pragma openmp simd); I know Intel has been pushing "outer-loop vectorization" very hard in the past few years. The other big fruit on the horizon is superoptimizers: I suspect we'll see superoptimizers shipping in production compilers within a decade or two.

评论 #14459092 未加载

评论 #14458668 未加载

评论 #14460227 未加载

评论 #14460722 未加载

scraftalmost 8 years ago

Is anyone else in games development here? If we are looking to run the game at 60 FPS, we have 16.67 ms per frame to do everything required to run the game. Because of this real time requirement, a decent amount of profiling is typically done on each game. I typically see that the frame time is getting split up over a whole array of different sections of the game, i.e.:- Calculating skeleton animations (updating bone positions, sometimes skinning vertices too)- Clipping geometry in the scene (finding out what things are inside/outside the camera frustum, etc.)- Processing game logic, things like AI can be quite costly, so much is game dependent- Walking through all the geometry that needs drawing and issueing draw calls- Decompressing streaming audio and sending it to a sound driver buffer/queue- Stepping the physics world (integrating positions/rotation working and resolving out intersections, etc.)The difference between a non optimized, and an optimized build, is often 5 FPS and 60 FPS, and optimizing a single hot file or function would not get the game running anywhere near 60 FPS. I think the idea that optimizing compilers aren't required is completely laughable, but then again I only have one perspective from the games development scene - maybe someone else will reply and say they make AAA games in C/C++ and don't need compiler optimizations :)

评论 #14460494 未加载

评论 #14462191 未加载

评论 #14461486 未加载

lukegoalmost 8 years ago

I often think about Proebsting's Law: Compiler Advances Double Computing Power Every 18 Years. Sure, optimizing compilers are nice to have, but maybe their complexity is disproportionate to their benefit?I love the way Dynamo [1] is able to reproduce many of the benefits with a fraction of the complexity by doing some of the optimizations at runtime with simpler algorithms. Can we use this approach to "garbage collect" some of the complexity embodied in humongous projects like LLVM?[1] Dynamo: <a href="https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall2004/papers/bala00dynamo.pdf" rel="nofollow">https://people.cs.umass.edu/~emery/classes/cmpsci691s-fall20...</a>

评论 #14458740 未加载

评论 #14458344 未加载

pettersalmost 8 years ago

It should be quite easy to see the value of optimizing compilers. Compile your program with optimizations turned off. Now make it as fast as your release build again, while still keeping them off. For much of my code, I think this would take years.

评论 #14458237 未加载

评论 #14458325 未加载

评论 #14458937 未加载

fizixeralmost 8 years ago

Great talk by DJB.IMO he couldn't give a convincing answer to the guy who asked about LuaJIT author being out of a job. But there's a clear answer. JIT authors are not out of job not because optimizing compilers are not dead, but because they're writing compilers, their distinguishing ability is writing "pre-compiled" code.You might say, "well a JIT author sped up your code's execution so he/she is writing an optimizing compiler". Well you have to realize that, traditionally, JIT authors don't just translate the code into object code, they also apply these things called "compiler optimizations". The point is that if they didn't do that, and simply produced a faithful translation of the code, they would still make the code faster because of pre-compilation (and if they enabled the "compiler optimizations", the code wouldn't run significantly faster than the simply pre-compiled code).Regardless of whether I agree with it or not, "Optimizing compilers are dead" is not the same as saying "JIT authors will be out of business". (Even compiler writers won't be out of business).

评论 #14458658 未加载

CJeffersonalmost 8 years ago

Having used some language with awful compilers, compiler optimisations let me write cleaner code.In languages with bad optimisers I have to worry about separating code in a hot loop out into a function -- the cost of a function call is too high. This one in particular I find can lead to some horrible code, as functions grow larger and larger and lots of cutting+pasting happens to avoid function call costs.On a smaller note, making sure I cache the values of function calls which won't change -- when instead I could trust the compiler to know the value won't change and the the caching itself.

评论 #14460060 未加载

lmmalmost 8 years ago

> If an optimizing compiler can speed up code by, for example, 50%, then suddenly we need to optimize a lot less code by hand.This doesn't follow at all. If you had one hot loop and a bunch of cold code, and auto-optimize your code to be a measly factor of 2 faster, you're still going to need to hand-optimize the hot loop and what it does to the cold code is irrelevant.> hand-optimized code has higher ongoing maintenance costs than does portable source code; we’d like to avoid it when there’s a better way to meet our performance goals.True, but again, only applies if you can optimize by enough to make hand-optimization unnecessary.> we’d also have to throw away many of those 16 GB phones that are cheap and plentiful and fairly useful today.This part is nonsense. No-one's got anything like 16GB of code on their phone.Optimization could be valuable but current compilers are too opaque, making optimization too much of a black art. I believe we need to do something along the lines of "turning the database inside out" ( <a href="https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/" rel="nofollow">https://www.confluent.io/blog/turning-the-database-inside-ou...</a> ); we should turn the compiler inside out, build it as more of a library, give the developer more insight into what's going on, have a high level language that lets you understand how it compiles. Interesting and vaguely along the same lines: <a href="https://www.microsoft.com/en-us/research/publication/coq-worlds-best-macro-assembler/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fpeople%2Fnick%2Fcoqasm.pdf" rel="nofollow">https://www.microsoft.com/en-us/research/publication/coq-wor...</a> .

评论 #14458947 未加载

zurnalmost 8 years ago

TL;DR "Optimizing compilers are still good to have because they are cheaper than programmer labour needed for hand optimization"The original DJB presentation, which this is a response to, is very good and interesting.It would really be nice if the field of compiler engineering started to address the obvious neglected areas, like optimizing memory layout and data types/representations based on partial evaluation / profile feedback.

评论 #14458291 未加载

评论 #14458294 未加载

评论 #14459442 未加载

fovcalmost 8 years ago

In the linked slides, DJB talks about a language for communication with the compiler, separating optimizations from specification. This reminded me of VPRI's "Meaning separated from optimization" [1] principle. Does anyone know what became of that line of thinking? Is this idea making it's way into Ohm? I remember reading a post/paper about optimizing Nile/Gezira to better exploit the CPU cache (and the struggle to use SIMD), but can't seem to find it now.[1] <a href="http://www.vpri.org/pdf/rn2006002_nsfprop.pdf" rel="nofollow">http://www.vpri.org/pdf/rn2006002_nsfprop.pdf</a>

评论 #14462256 未加载

jerrrealmost 8 years ago

> Compiler optimization reduces code sizeNope, much is gained by unrolling loops, inlining functions etc, which all increase code size.Of course C++ compilation with no optimization at all can be rather wasteful with performance and code size, but to squeeze the final performance out you probably need to sacrifice code size (whether manual or automatic)

评论 #14458807 未加载

nullcalmost 8 years ago

I'm disappointed at the lack of figures.

mrkgnaoalmost 8 years ago

I'm posting this as a top-level comment, but it's really a reply to the discussion downthread about compilers being able to work magic if we let them. Better still, why not help them?Something I took for granted for the longest time about Haskell (which remains the only language I know of with the feature) is the ability to write user-defined "rewrite rules". You can say, "okay, GHC, I know for a fact that if I use these functions in such-and-such way, you can replace it by this instead".<pre><code> {-# RULES foo x (bar x) = superOptimizedFooBar x #-} </code></pre> A rule like this would be based on the programmer's knowledge of FooBar theory, which tells her that such an equality holds. The compiler hasn't studied lax monoidal FooBaroids and cannot be expected to infer this on its own. :)Now, anywhere a user of this code writes something like<pre><code> foo [1,2,3] (bar [1,2,3]) </code></pre> the compiler will substitute<pre><code> superOptimizedFooBar [1,2,3] </code></pre> in its place. This is a nice way to bring the compiler "closer" to the programmer, and allow the library author to integrate domain-specific knowledge into the compiler's optimizations.You can also "specialize" by using faster implementations in certain cases. For example,<pre><code> timesFour :: Num a => a -> a timesFour = a + a + a + a timesFourInt :: Int -> Int timesFourInt x = rightShift x 2 {-# RULES timesFour :: Int -> Int = timesFourInt #-} </code></pre> If you call timesFour on a double, it will use addition (ha!) but using it on an Int uses bitshifting instead because this rule fires.High-performance Haskell libraries like vector, bytestring, text, pipes, or conduit capitalize on this feature, among other techniques. When compiling code written using libraries like this, this is how it goes:- rule #1 fires somewhere - it rewrites the code into something that matches rule #2, "clearing the way" for it to fire - rule #2 fires - rule #3 fires - rule #1 fires again - rule #4 firesand so on, triggering a "cascade" of optimizations.The promise of Haskell is that we already have a "sufficiently smart compiler": today, with good libraries, GHC is capable of turning clear, high-level, reusable functional code with chains of function compositions and folds and so on into tight, fast loops.--I must add, though, that getting rewrite rules to fire in cascades to get "mad gainz" requires one to grok how the GHC inliner/specializer works.<a href="http://mpickering.github.io/posts/2017-03-20-inlining-and-specialisation.html" rel="nofollow">http://mpickering.github.io/posts/2017-03-20-inlining-and-sp...</a>Data.Vector also utilizes an internal representation that makes fusion explicit and hence predictable (inevitable, even) called a "bundle":<a href="https://www.stackage.org/haddock/lts-8.16/vector-0.11.0.0/Data-Vector-Fusion-Bundle.html" rel="nofollow">https://www.stackage.org/haddock/lts-8.16/vector-0.11.0.0/Da...</a>but this relies on rewrite rules too, e.g. the previous module contains this rule:<pre><code> {-# RULES "zipWithM xs xs [Vector.Stream]" forall f xs. zipWithM f xs xs = mapM (\x -> f x x) xs #-}</code></pre>