科技回声

16 条评论

So it's an issue of the optimizer; as is often the case, it unrolls too aggressively and shoots itself in the foot, all the while missing out on various other opportunities.In my experience, loop unrolling should basically never be done except in extremely degenerate cases; I remember not long ago someone I know who also optimises Asm remarking "it should've died along with the RISC fad". The original goal was to reduce per-iteration overhead associated with checking for end-of-loop, but any superscalar/OoO/speculative processor can "execute past" those instructions anyway; all that unrolling will do is bloat the code and work against caching. Memory bandwidth is often the bottleneck, not the core.

评论 #18575282 未加载

评论 #18575832 未加载

评论 #18576505 未加载

评论 #18575759 未加载

评论 #18575516 未加载

评论 #18576984 未加载

jepler超过 6 年前

You should translate your program to C++ and build with clang ; it turns the loop into a single constant load. <a href="https://godbolt.org/z/slznbU" rel="nofollow">https://godbolt.org/z/slznbU</a>

评论 #18575390 未加载

评论 #18577233 未加载

评论 #18577363 未加载

评论 #18575132 未加载

beeforpork超过 6 年前

With all the optimisations being implemented in compilers today, it is impressive to see how this opportunity to optimise is missed. Put differently, compiler writers bother about optimisations that gain 0.1% performance in some special cases, but others that could gain 20% performance are not implemented.Why? Is this optimisation particularly difficult to implement? Or is it just missed low-hanging fruit? It sure looks easy (like: rearrange expressions to keep the expression tree shallow and left-branching to avoid stack operations).

评论 #18577408 未加载

评论 #18578538 未加载

评论 #18577333 未加载

评论 #18577083 未加载

techopoly超过 6 年前

That just might be the most dedicated answer I've ever seen on Stack Overflow.

评论 #18575091 未加载

评论 #18576225 未加载

评论 #18576214 未加载

dreamcompiler超过 6 年前

I thought at first this was because integer squaring is potentially faster than general integer multiplication and the compiler wasn't seeing the square operation in the second case, but that's not the explanation here.

评论 #18575575 未加载

ww520超过 6 年前

I'm surprised it's not doing a left shift for the x2.

评论 #18574985 未加载

评论 #18574991 未加载

podsnap超过 6 年前

The graal behavior is a lot more sane:<pre><code> graal: [info] SoFlow.square_i_two 10000 avgt 10 5338.492 ± 36.624 ns/op // 2 *\sum i * i [info] SoFlow.two_i_ 10000 avgt 10 6421.343 ± 34.836 ns/op // \sum 2 * i * i [info] SoFlow.two_square_i 10000 avgt 10 6367.139 ± 34.575 ns/op // \sum 2 * (i * i) regular 1.8: [info] SoFlow.square_i_two 10000 avgt 10 6393.422 ± 27.679 ns/op [info] SoFlow.two_i_ 10000 avgt 10 8870.908 ± 35.715 ns/op [info] SoFlow.two_square_i 10000 avgt 10 6221.205 ± 42.408 ns/op </code></pre> The graal-generated assembly for the first two cases is nearly identical, featuring unrolled repetitions of sequences like<pre><code> [info] 0x000000011433ec03: mov %r8d,%ecx [info] 0x000000011433ec06: shl %ecx ;*imul {reexecute=0 rethrow=0 return_oop=0} [info] ; - add.SoFlow::test_two_i_@15 (line 41) [info] 0x000000011433ec08: imul %r8d,%ecx ;*imul {reexecute=0 rethrow=0 return_oop=0} [info] ; - add.SoFlow::test_two_i_@17 (line 41) [info] 0x000000011433ec0c: add %ecx,%r9d ;*iadd {reexecute=0 rethrow=0 return_oop=0} [info] ; - add.SoFlow::test_two_i_@18 (line 41) [info] 0x000000011433ec0f: lea 0x5(%r11),%r8d ;*iinc {reexecute=0 rethrow=0 return_oop=0} [info] ; - add.SoFlow::test_two_i_@20 (line 40) </code></pre> while the third case does a single shl at the end.<pre><code> [info] 0x000000010e2918bb: imul %r8d,%r8d ;*imul {reexecute=0 rethrow=0 return_oop=0} [info] ; - add.SoFlow::test_square_i_two@15 (line 32) [info] 0x000000010e2918bf: add %r8d,%ecx ;*iadd {reexecute=0 rethrow=0 return_oop=0} [info] ; - add.SoFlow::test_square_i_two@16 (line 32) [info] 0x000000010e2918c2: lea 0x3(%r11),%r8d ;*iinc {reexecute=0 rethrow=0 return_oop=0} [info] ; - add.SoFlow::test_square_i_two@18 (line 31) </code></pre> Both graal and C2 inline, but as usual the graal output is a lot more comprehensible.

bnegreve超过 6 年前

I don't see how generating different code for the same mathematical expression can be a good thing.The compiler should detect that the two expressions are strictly equivalent and generate whatever code it believes is the fastest.Any idea why it is this way?

评论 #18576431 未加载

评论 #18576659 未加载

评论 #18576295 未加载

crb002超过 6 年前

TIL about printing ASM from debug JVMs.

评论 #18575818 未加载

alkonaut超过 6 年前

Is Overflow UB so the compiler can choose to ignore the fact that 2x(i x i) could overflow differently from 2 x i x i?I’m not sure it does overflow differently but I would expect overflow to behave consistently as written, and not be dependent on optimization, is that not the case?

评论 #18578844 未加载

Koshkin超过 6 年前

At first, I thought it was because i * i == -1.

microcolonel超过 6 年前

I guess they do not use value numbering, which is typically how you get equivalent results for cases like this.

qwerty456127超过 6 年前

IMHO some kind of logic preprocesor should take care of this before the actual compilation.

评论 #18578034 未加载

polskibus超过 6 年前

I wonder if the same applies to .net (fx/core).

评论 #18578067 未加载

networkimprov超过 6 年前

Has anyone tried this with Go?

评论 #18576283 未加载

评论 #18576842 未加载

JohnL4超过 6 年前

The database is fast enough for a few extra trips to it, so this is definitely what we should be focusing on.(My cup of bitterness doth overflow.)

16 条评论

userbinator超过 6 年前

评论 #18575282 未加载

评论 #18575832 未加载

评论 #18576505 未加载

评论 #18575759 未加载

评论 #18575516 未加载

评论 #18576984 未加载

jepler超过 6 年前

You should translate your program to C++ and build with clang ; it turns the loop into a single constant load. <a href="https://godbolt.org/z/slznbU" rel="nofollow">https://godbolt.org/z/slznbU</a>

评论 #18575390 未加载

评论 #18577233 未加载

评论 #18577363 未加载

评论 #18575132 未加载

beeforpork超过 6 年前

评论 #18577408 未加载

评论 #18578538 未加载

评论 #18577333 未加载

评论 #18577083 未加载

techopoly超过 6 年前

That just might be the most dedicated answer I've ever seen on Stack Overflow.

评论 #18575091 未加载

评论 #18576225 未加载

评论 #18576214 未加载

dreamcompiler超过 6 年前

评论 #18575575 未加载

ww520超过 6 年前

I'm surprised it's not doing a left shift for the x2.

评论 #18574985 未加载

评论 #18574991 未加载

podsnap超过 6 年前

bnegreve超过 6 年前

评论 #18576431 未加载

评论 #18576659 未加载

评论 #18576295 未加载

crb002超过 6 年前

TIL about printing ASM from debug JVMs.

评论 #18575818 未加载

alkonaut超过 6 年前

评论 #18578844 未加载

Koshkin超过 6 年前

At first, I thought it was because i * i == -1.

microcolonel超过 6 年前

I guess they do not use value numbering, which is typically how you get equivalent results for cases like this.

qwerty456127超过 6 年前

IMHO some kind of logic preprocesor should take care of this before the actual compilation.

评论 #18578034 未加载

polskibus超过 6 年前

I wonder if the same applies to .net (fx/core).

评论 #18578067 未加载

networkimprov超过 6 年前

Has anyone tried this with Go?

评论 #18576283 未加载

评论 #18576842 未加载

JohnL4超过 6 年前

The database is fast enough for a few extra trips to it, so this is definitely what we should be focusing on.(My cup of bitterness doth overflow.)

Why is 2 * (i * i) faster than 2 * i * i in Java?

16 条评论

Why is 2 * (i * i) faster than 2 * i * i in Java?

16 条评论