TechEcho

4 comments

tzsover 3 years ago

Most architectures seem to have an instruction for finding the leftmost set bit in a word [1] and most major C/C++ compilers/libraries have a function for it [2].I wonder how using that would compare?You'd find the leftmost bit, invoke the func_true for that, and clear the leftmost bit.Then loop finding the leftmost bit, invoking the func_false for the bits between the current leftmost bit and the previous leftmost bit, func_true for the leftmost bit, and then clear that bit.I wouldn't expect it to be dramatically different, but it is trading test and clear on the off bits and a conditional branch off of the result for a loop calling func_false a predetermined number of times. One of those options might be a little faster than the other.[1] <a href="https://en.wikipedia.org/wiki/Find_first_set#Hardware_support" rel="nofollow">https://en.wikipedia.org/wiki/Find_first_set#Hardware_suppor...</a>[2] <a href="https://en.wikipedia.org/wiki/Find_first_set#Tool_and_library_support" rel="nofollow">https://en.wikipedia.org/wiki/Find_first_set#Tool_and_librar...</a>

评论 #29667976 未加载

评论 #29667912 未加载

RantyDaveover 3 years ago

Ummm, no. I ran the routine through clang 13 with '-O' and didn't get an explicit compare.

tyfighterover 3 years ago

Why not use x86's BSF or BSR instructions? They've been around since 386.

评论 #29666675 未加载

ncmncmover 3 years ago

This is a very artificial sort of task. You have to visit all bits down to the least significant 1 bit, and ignore the rest? In designing the system, since the assignment of bits to meaning is probably arbitrary, it is smarter to assign them in the way that is more convenient and less error-prone to code against, arranging so the 1 bits may be visited least-significant first.But, OK.(It is probably important, here, to call more attention to the fact that shifting a signed value left until its sign changes would be Undefined Behavior, making your optimizer likely to do something crazy. That is why the code sample must cast a shifted uint64_t to an int64_t before comparing to zero; that conversion is well-defined. Reader be advised!)In the final, indexing, example, the code the compiler generates is to shift the word right 63 places, instead of checking the sign, to make the array index. You might as well do that in the source code, instead of coding a check of the sign bit that the compiler will convert to what you really meant anyway.The final trick of indexing an array to get the right function to call is a good one, and could be a lot faster if the bit patterns are close to random. Branch predictors are very good at teasing out patterns, but the penalty for guessing wrong is very high. A possibly better way, particularly on clang, would look more like<pre><code> (mask >> 63 ? func_true : func_false)(i); </code></pre> Modern chips have a "cmov" instruction that is likely to be better than indexing into an array, with its a three-cycle trip out to L1 cache: on a modern chip, that is enough time to execute 18 other instructions!In the loop, there is no point in comparing i to zero, because you are going to break out when the mask hits zero anyway. So, that is just wasted code; deleting it would eliminate the useless "dec %ebp; js 108" instructions. That branch is reliably predicted, so mostly just costs cache footprint, not cycles. But the micro-operations cache is very small, and if the extra instructions make you blow your micro-op cache, that could be pretty expensive.Finally, and most important: the number of instructions is an extremely poor way to estimate performance. Often, many instructions take less time to run than a few, different instructions. So, to make a smart choice about performance, or to present actually meaningful results in your blog post, you need to set up a micro-benchmark and actually see what really happens, and not just guess by counting lines. There are extremely good micro-benchmark tools available free online. Google's is as good as any. Or you could paste the code into Godbolt.org and benchmark it there.So, in the end, this might be best:<pre><code> void loop_v4(uint64_t mask) { for (int i=63; mask; mask <<= 1, --i) (mask >> 63 ? func_true : func_false)(i); } </code></pre> or possibly<pre><code> (mask & ~(-1ull>>1) ? func_true : func_false)(i); </code></pre> or<pre><code> ((int64_t)mask < 0 ? func_true : func_false)(i); </code></pre> But again, you need to measure (which I didn't do either). Results might be very, very different between x86 and ARM. Of course the results will depend on what happens in func_true and in func_false, too. And, you might find this is altogether the wrong place spend your optimizing attention.

评论 #29672796 未加载

4 comments

tzsover 3 years ago

评论 #29667976 未加载

评论 #29667912 未加载

RantyDaveover 3 years ago

Ummm, no. I ran the routine through clang 13 with '-O' and didn't get an explicit compare.

Bit Test and Reset vs. Compilers

4 comments

Bit Test and Reset vs. Compilers

4 comments