Back in the mid-2000s I worked at a company that made their own (MIPS-based) chips. NSA was one of our customers - supposedly the "defense" who could be considered the good side of NSA compared to the 10x larger "offense" but still. As we were planning for our second generation, they offered quite a bit of money if we'd implement a "sheep and goats" instruction. It would take two operands: an input and a mask. The masked-in bits of the input (the "sheep") would be packed toward the MSB of the output, while the masked-out bits (the "goats") would be packed toward the LSB. We had a lot of people on staff with serious chops in all sorts of math including cryptography, but none of them could identify an algorithm that would benefit from having such an instruction (as distinct from more conventional range-based bitfield instructions). Since the company went under shortly afterward, it remained a mystery. I still wonder about it.
Some years back, I got myself a copy of Andrew Hodges "Alan Turing - The Enigma", a biography and IMO generally a good read, but also with some gems regarding very early computing history in it.<p>Specifically, after WWII, Turing worked on the ACE 1 (later reduced to Pilot ACE) project to build an electronic computer, which didn't really progress due to management and bureaucracy overhead. He eventually went to Manchester, once they got their Manchester Mark 1 off the ground, which they tried to commercialize as "Ferranti Mark 1" (<a href="https://en.wikipedia.org/wiki/Ferranti_Mark_1" rel="nofollow">https://en.wikipedia.org/wiki/Ferranti_Mark_1</a>).<p>While employed for the University, Turing IIRC continued to work as an external consultant for whatever became of G.C. & C.S. on the side. According to the book, he convinced them to buy such a machine (presumably for crypt-analysis?) and, on the Manchester side of things, insisted on some modifications to be made, including a "horizontal adder", so it could count the number of bits set in a word with a single instruction, i.e. a popcount instruction. This would pre-date the IBM Stretch mentioned in the article.
The consensus on the 1992 thread (including a really great comment from 'Animats) seems to be that `popcount` was generally not added to architectures at NSA's request --- that people familiar with those archs knew the actual reason `popcount` wound up in the ISA, and it preceded NSA purchases.<p><a href="https://groups.google.com/g/comp.arch/c/UXEi7G6WHuU/m/Z2z7fC7Xhr8J" rel="nofollow">https://groups.google.com/g/comp.arch/c/UXEi7G6WHuU/m/Z2z7fC...</a>
Counting bits was the bottleneck in the genomic scan I co-authored (Kanoungi et al. 2020). popcnt resulted in insane perfomance gains comared to all other methods.<p>However, we re-discovered the fact that some Intel CPUs, including the Nehalem mentioned in the article, have a bug that severly affects popcnt's performance, see for example here: <a href="https://github.com/komrad36/LATCH/issues/3#issuecomment-267132818" rel="nofollow">https://github.com/komrad36/LATCH/issues/3#issuecomment-2671...</a>
It is possible that the "population count" instruction has been included in the instruction sets of most American supercomputers at the request of NSA, which was an important customer for them.<p>Nevertheless, the first computer having this instruction was a British computer, the Ferranti Mark I (February 1951).<p>The name used by Ferranti Mark I for this instruction was "sideways add".<p>Also notable was that Ferranti Mark I had the equivalent of LZCNT (count leading zeroes) too.<p>Both instructions are very useful and they are standard now for modern instruction sets, but they were omitted in most computers after Ferranti Mark I, except in expensive supercomputers.
Obviously using a dedicated instruction is fastest in normal cases.<p>But if you need to implement popcount or many other bit manipulation algorithms in software, a good book to look at is "Hacker's Delight" by Henry S. Warren, Jr, 2003.<p>"Hacker's Delight' page 65+ discuss "Counting 1-bits" (population counts). There are a lot of software algorithms to do this.<p>One approach is to set each 2-bit field to the count of 2 1-bit fields, then each 4-bit field to the count of 2 2-bit fields, etc., like this:<p><pre><code> x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0f0f0f0f) + ((x >> 4) & 0x0f0f0f0f);
x = (x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff);
x = (x & 0x0000ffff) + (x >> 16);
</code></pre>
assuming x is 32 bits.<p>I think this approach is a classic divide-and-conquer solution.
GPU-programmers use popcount-based programming all the time these days, but the abstractions are built on top and are hardware accelerated.<p>CUDA's __activemask(); returns the 32-bit value of your current 32-wide EXEC mask. That is to say, if your current warp is:<p><pre><code> int foo = 0;
if(threadIdx.x %= 2){
foo = __activemask();
}
</code></pre>
foo will be "0b01010101...." or 0x55555555. This __activemask() has a number of useful properties should you use __popc with it.<p>popcount(__activemask()); returns the number of threads executing.<p>lanemask_lt() returns "0b0000000000000001" for the 0th lane. 0b0000000000000011 for the 1st lane. 0b0000000000000111... for the 2nd lane... and 111111111...111 for the last 31st lane.<p>popcount(__activemask() & lanemask_lt()); returns the "active lane count". All together now, we can make a parallel SIMD-stack that can push/pop together in parallel.<p><pre><code> int head = 0;
char buffer[0x1000];
while(fooBar()){ // Dynamic! We don't know who is, or is not active anymore
int localPrefix = __popc(__activemask() & __lanemask_lt());
int totalWarpActive = __popc(__activemask());
buffer[head + localPrefix] = generateValueThisThread();
if(localPrefix == 0){
head += totalWarpActive; // Move the head forward, much like a "push" operation in single-thread land
// Only one thread should move the head
}
__syncthreads(); // Thread barrier, make sure everyone is waiting on activeThread#0 before continuing.
}
</code></pre>
------------<p>As such, you can dynamically load-balance between GPU threads (!!!) from a shared stack with minimal overheads.<p>If you want to extend this larger than one 32-wide CUDA-warp, you'll need to use __shared __ memory to share the prefix with the rest of the block.<p>It is a bad idea (too much overhead) to extend this much larger than a block, as there's no quick way to communicate outside of your block. Still though, having chunks of up to 1024 threads synchronized through a shared data-structure that only has nanoseconds of overhead is a nifty trick.<p>-----------<p>EDIT: Oh right, and this concept is now replicated very, very quickly in the dedicated __ballot_sync(...) function (which compiles down to just a few assembly instructions).<p>Playing with the "Exec-mask" is a hugely efficient way to synchronously, and dynamically gather information across your warp. So lots of little tricks have been built around this.
Another interesting application of popcount is in computer vision, namely in matching keypoints that use binary descriptors for 3D reconstruction in SLAM/TRN etc
Discussed at the time: <a href="https://news.ycombinator.com/item?id=20914479" rel="nofollow">https://news.ycombinator.com/item?id=20914479</a>
It is appalling that, after <i>every</i> other general-computing architecture in common use either started out with a popcount instruction, or had one added later at substantial expense, RISC-V came out without one.<p>It still doesn't have any. The proposed B, "bitmanip" extension has it (along with a raft of trivial variations: count leading zeroes, count trailing ones, yada yada) but that is not ratified and not implemented in any chip I know of. Since B is a huge extension, we can expect it will be routinely omitted even after it's ratified, and compilers will need special prodding to produce any such instructions.<p>It should have been in the base instruction set. We probably can blame its lack on the academic origins of the design. CS professors probably think of it as a thing not needed to implement Lisp, therefore not worth class time.<p>(Some people say, "Oh, but you can trap and emulate it", which adds insult to injury. Trapping and emulating eliminates all the value the instruction offers.)
My first thought "How else do you quickly count pieces on a bitboard?". Definitely chess programming caused me to never second guess the usefulness of `popcount`
Here's a dumb question. If someone asked me to do it I'd probably write code like:<p>while(x != 0) {
c += x&1;
x >>= 1;
}<p>Is this something that should be added to LLVM?<p>Edit: flip the order