Reversing Bits in C

216 点作者 pvilchez将近 12 年前

19 条评论

This article overlooks a major factor in bit-twiddling performance on modern CPUs: saturation of the execution ports in a CPU core.An Intel i7 core has six execution ports, three of which are ALUs of various types. Depending on the specific instruction and the dependencies between instructions, the CPU can execute up to 3 simple integer operations every clock cycle mixed with operations like loads and stores at the same time. For most algorithms, particularly those that are not carefully designed, multiple execution ports may be sitting idle for a given clock cycle. (Hyper-threads work by opportunistically using these unused execution ports.)Consequently, algorithms with a few extra operations but more operation parallelism will frequently be faster than an equivalent algorithm where the operations are necessarily serialized in the CPU.Furthermore, the compiler and CPU may have a difficult time discerning when instructions in some algorithms can be executed in parallel across execution ports. Seemingly null changes to the implementation of such algorithms, such as using splitting the algorithm across two accumulator variables and combining them at the end when any normal programmer would just use one variable to achieve the same thing can have a large impact on performance. I once doubled the performance of a bit-twiddling algorithm simply by taking the algorithm and using three variables instead of one. The algorithm was identical but the use of three registers exposed the available parallelism to the CPU.

评论 #6130322 未加载

评论 #6130280 未加载

评论 #6133799 未加载

rainforest将近 12 年前

The multiplication trick reminds me of this StackOverflow answer[1] where an SMT solver (z3) is used to derive mask and multiplier to extract chosen bits from a byte.[1] : <a href="http://stackoverflow.com/questions/14547087/extracting-bits-with-a-single-multiplication/14551792#14551792" rel="nofollow">http://stackoverflow.com/questions/14547087/extracting-bits-...</a>

评论 #6131984 未加载

mgraczyk将近 12 年前

The author seems to misunderstand the idea of asymptotic complexity. All of the reversal operations are O(1) because the number of bits being flipped is a constant. If he were concerned with flipping the bits in an arbitrary precision number, then his different solutions might deserve "Big-O" classifications.Second point: The reason that the original solution is slow is because a mod operation by a number that is not a power of two involves a floating point divide, or several multiply accumulates at extended precision. Either of those two operations are slower than any of the other methods.

cnvogel将近 12 年前

Interestingly, while x86-64 does not seem to have a single opcode for reversing bits in a byte, it has a function to arbitrarily shuffle around the 16 bytes in a 128bit SSE register [PSHUFB]. It just blows my mind how much data those SIMD instructions process or move around in relatively few clock-cycles.<a href="http://stackoverflow.com/a/9040426" rel="nofollow">http://stackoverflow.com/a/9040426</a><a href="http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html" rel="nofollow">http://www.intel.com/content/www/us/en/processors/architectu...</a> (it's on page 1256 of 3251).

评论 #6129427 未加载

daniel-cussen将近 12 年前

In the GA144, lookup tables are pretty painful, so the way I implement reverse there is:reverse: a! 16 push . 2 dup . . begin +x 2* 2* unext +x 2* a . + nip ;In Intel x86/64, the fastest way I know of is to use SIMD instructions, and break the 64-bit word into 16 nibbles (4-bit pieces), and use PSHUFB to perform a parallel lookup against another 128-bit xmm register. Then you aggregate the nibbles in reverse order, using inclusive or and variants of the shuffle instruction.

评论 #6129689 未加载

twoodfin将近 12 年前

The article makes the point that the lookup table version is fast because the table fits in D$, and that if the table were evicted it would be slower. This is true, but the more interesting point is that by loading this table into D$, you're potentially slowing down other operations.It's an important conundrum of optimization that if you had 20 similarly complex functions in a critical path, implementing and benchmarking each individually with a lookup table could show excellent performance while globally performance is terrible. And worse, it's uniformly terrible, with no particular function seeming to be consuming an inordinate amount of the runtime or, for that matter, D$ misses.

评论 #6131173 未加载

Scaevolus将近 12 年前

I'm glad they noted that the lookup table's speed relies on it being in cache, which most "benchmark magic bit-fiddling operations" posts ignore. (Although it's temporal locality, not cache coherence, that's important for this.)

_ihaque将近 12 年前

Along the same vein, Andrew Dalke wrote up an interesting series of blog posts benchmarking different implementations of population count (counting the number of set bits in a word):<a href="http://dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html" rel="nofollow">http://dalkescientific.com/writings/diary/archive/2008/07/03...</a><a href="http://dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html" rel="nofollow">http://dalkescientific.com/writings/diary/archive/2008/07/05...</a><a href="http://dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html" rel="nofollow">http://dalkescientific.com/writings/diary/archive/2011/11/02...</a>The Stanford Bit Hacks page linked in the original article is also very interesting reading for folks into this sort of stuff.

fjarlq将近 12 年前

A great companion to this sort of thing is the book Hacker's Delight by Henry S. Warren, Jr:<a href="http://www.hackersdelight.org/" rel="nofollow">http://www.hackersdelight.org/</a><a href="http://www.amazon.com/Hackers-Delight-2nd-Edition-ebook/dp/B009GMUMTM/" rel="nofollow">http://www.amazon.com/Hackers-Delight-2nd-Edition-ebook/dp/B...</a>

kibwen将近 12 年前

"Intel x86/x64 processors don’t have this instruction, so this is definitely not a portable solution."This stuck out to me. I know that RISC vs CISC is basically a meaningless distinction nowadays, but I still naively expected that x86 would be more-or-less a strict superset of ARM.

评论 #6129762 未加载

评论 #6130036 未加载

Symmetry将近 12 年前

Very interesting, though you shouldn't be surprised by small differences between O(1) and O(N) algorithms when N is only 8.

评论 #6130011 未加载

KaeseEs将近 12 年前

Great analysis, although I'm curious how the idea that doing a bunch of 64 bit ops in order to accomplish byte arithmetic came about to begin with - was the function in question not written by a firmware guy?

评论 #6129486 未加载

评论 #6129468 未加载

applecore将近 12 年前

Interesting. What's the purpose of reversing the bits in a byte?

评论 #6129911 未加载

评论 #6130769 未加载

评论 #6129928 未加载

评论 #6132081 未加载

评论 #6129836 未加载

robomartin将近 12 年前

If you've ever dealt with graphics file manipulation code chances are you've suffered the pain of changing the endian-ness of an image file. I never understood why some of these operations are not implemented as machine instructions that can run in one instruction cycle flat. There's nothing to them, I've done exactly that on FPGA's. Yes, they can be a little resource/routing intensive but not that bad.

评论 #6129935 未加载

评论 #6130586 未加载

评论 #6130455 未加载

mzs将近 12 年前

Oh man this is one of my nits. I've written code like this. For example in some bit counting code I have a block comment in front of all that with 57 lines that are not blank. I have a copy of Hacker's Delight on my bookshelf, but will the person after me know what and how that code works? I really hope that there was a comment before that pointing to one of the hack web pages at least.

chacham15将近 12 年前

The difference here between the obvious method and the best method is 55ns. Is there a reason that this problem deserves this much attention for as little a difference in time? (I realize that it is 6.5x more, but if it isnt at the center of some core loop, the multiplicative factor doesnt really matter). What use cases are there for this?

评论 #6131019 未加载

munificent将近 12 年前

> That’s one mathematical operation, but a large number of CPU instructions. CPU instructions are what matter here, though, as we see, not as much as cache coherency.I thought it was also a single CPU instruction, but multiple clock cycles.

评论 #6130686 未加载

评论 #6131685 未加载

barbs将近 12 年前

Ack! Light grey on white background! My eyes!! Seriously, that's really annoying.

duedl0r将近 12 年前

Why on earth does this article have so many upvotes? Running time analysis is completely wrong... O(n) vs O(1) and such...tss..don't get me started..