One key metric where Gcc and Clang differ is in their willingness to produce 'cmov', "conditional move", instructions. ('cmov' is an instruction that implements '(a=c?b:a)' without a conditional branch.)<p>Partition is an essential component of numerous key algorithms; almost half of C++ STL algorithms depend on it. A key primitive to implement partition is the 'swap_if' operation. Performance of this operation thus directly determines the performance of many important algorithms. For best performance with word-sized arguments, 'swap_if' must be implemented with a pair of back-to-back 'cmov' instructions.<p>Gcc got its fingers burned a few years ago from emitting 'cmov' in loop contexts where the condition was predictable and the result immediately depended upon in the next iteration, creating dependency chains that slowed execution of some benchmarks by a factor of 2. Now, Gcc will never, under any circumstances, emit two 'cmov' instructions in one basic block.<p>Clang is happy to emit a pair of back-to-back 'cmov' instructions, which enables it to perform partition operations very substantially faster than Gcc. However, it fails to generate 'cmov' instructions in important cases.<p>The fully general expression of 'swap_if' looks like:<p><pre><code> template <typename T>
bool swap_if(
bool c, T& a, T& b)
{
T v[2] = { a, b };
b = v[1-c], a = v[c];
return c;
}
</code></pre>
It may be used in the inner loop of quicksort's partition operations as<p><pre><code> right += swap_if(
*left < pivot,
*left, *right);
</code></pre>
A 'swap_if' implemented with 'cmov' makes quicksort fully twice as fast as, e.g., current 'std::sort'. However, neither Gcc nor Clang recognizes this 'swap_if' as a place to substitute 'cmov' instructions for the pointer/offset accesses. (The latter are slower, probably because they produce more L1 bus traffic.)<p>Clang will happily emit 'cmov' instructions for a specialization like<p><pre><code> template <>
bool swap_if(
bool c, int& a, int& b)
{
int v[2] = { a, b };
a = -c&v[1] | 1-c&v[0];
b = 1-c&v[0] | -c&v[1];
return c;
}
</code></pre>
where Gcc emits exactly the 'and' and 'or' instructions.
(Interestingly, Gcc's code for both definitions, despite failing to use 'cmov', runs quicksort quite a lot faster than std::sort; just not 2x as fast.) Clang will also emit 'cmov' instructions for<p><pre><code> bool swap_if(
bool c, int& a, int& b)
{
int ta = a, tb = b;
a = c ? tb : ta;
b = c ? ta : tb;
return c;
}
</code></pre>
but Gcc will only produce very slow (i.e. badly predicted) branches for this case.<p>If we had 'std::swap_if' in the C++ Standard Library, probably all implementations would produce good code for it.<p>(As an aside, I find it odd that changing the second line of the template 'swap_if' above to<p><pre><code> a = v[c], b = v[!c];
</code></pre>
makes it much, much slower when compiled with Gcc for recent Intel. I would welcome any insight into why this is.)