The NSA Instruction (2019)

244 pointsby cjgalmost 4 years ago

17 comments

notacowardalmost 4 years ago

Back in the mid-2000s I worked at a company that made their own (MIPS-based) chips. NSA was one of our customers - supposedly the "defense" who could be considered the good side of NSA compared to the 10x larger "offense" but still. As we were planning for our second generation, they offered quite a bit of money if we'd implement a "sheep and goats" instruction. It would take two operands: an input and a mask. The masked-in bits of the input (the "sheep") would be packed toward the MSB of the output, while the masked-out bits (the "goats") would be packed toward the LSB. We had a lot of people on staff with serious chops in all sorts of math including cryptography, but none of them could identify an algorithm that would benefit from having such an instruction (as distinct from more conventional range-based bitfield instructions). Since the company went under shortly afterward, it remained a mystery. I still wonder about it.

评论 #27474813 未加载

评论 #27474793 未加载

评论 #27477096 未加载

评论 #27478185 未加载

评论 #27479007 未加载

评论 #27477333 未加载

评论 #27477876 未加载

st_goliathalmost 4 years ago

Some years back, I got myself a copy of Andrew Hodges "Alan Turing - The Enigma", a biography and IMO generally a good read, but also with some gems regarding very early computing history in it.Specifically, after WWII, Turing worked on the ACE 1 (later reduced to Pilot ACE) project to build an electronic computer, which didn't really progress due to management and bureaucracy overhead. He eventually went to Manchester, once they got their Manchester Mark 1 off the ground, which they tried to commercialize as "Ferranti Mark 1" (<a href="https://en.wikipedia.org/wiki/Ferranti_Mark_1" rel="nofollow">https://en.wikipedia.org/wiki/Ferranti_Mark_1</a>).While employed for the University, Turing IIRC continued to work as an external consultant for whatever became of G.C. & C.S. on the side. According to the book, he convinced them to buy such a machine (presumably for crypt-analysis?) and, on the Manchester side of things, insisted on some modifications to be made, including a "horizontal adder", so it could count the number of bits set in a word with a single instruction, i.e. a popcount instruction. This would pre-date the IBM Stretch mentioned in the article.

tptacekalmost 4 years ago

The consensus on the 1992 thread (including a really great comment from 'Animats) seems to be that `popcount` was generally not added to architectures at NSA's request --- that people familiar with those archs knew the actual reason `popcount` wound up in the ISA, and it preceded NSA purchases.<a href="https://groups.google.com/g/comp.arch/c/UXEi7G6WHuU/m/Z2z7fC7Xhr8J" rel="nofollow">https://groups.google.com/g/comp.arch/c/UXEi7G6WHuU/m/Z2z7fC...</a>

评论 #27477442 未加载

drichelalmost 4 years ago

Counting bits was the bottleneck in the genomic scan I co-authored (Kanoungi et al. 2020). popcnt resulted in insane perfomance gains comared to all other methods.However, we re-discovered the fact that some Intel CPUs, including the Nehalem mentioned in the article, have a bug that severly affects popcnt's performance, see for example here: <a href="https://github.com/komrad36/LATCH/issues/3#issuecomment-267132818" rel="nofollow">https://github.com/komrad36/LATCH/issues/3#issuecomment-2671...</a>

adrian_balmost 4 years ago

It is possible that the "population count" instruction has been included in the instruction sets of most American supercomputers at the request of NSA, which was an important customer for them.Nevertheless, the first computer having this instruction was a British computer, the Ferranti Mark I (February 1951).The name used by Ferranti Mark I for this instruction was "sideways add".Also notable was that Ferranti Mark I had the equivalent of LZCNT (count leading zeroes) too.Both instructions are very useful and they are standard now for modern instruction sets, but they were omitted in most computers after Ferranti Mark I, except in expensive supercomputers.

评论 #27473191 未加载

评论 #27473304 未加载

dwheeleralmost 4 years ago

Obviously using a dedicated instruction is fastest in normal cases.But if you need to implement popcount or many other bit manipulation algorithms in software, a good book to look at is "Hacker's Delight" by Henry S. Warren, Jr, 2003."Hacker's Delight' page 65+ discuss "Counting 1-bits" (population counts). There are a lot of software algorithms to do this.One approach is to set each 2-bit field to the count of 2 1-bit fields, then each 4-bit field to the count of 2 2-bit fields, etc., like this:<pre><code> x = (x & 0x55555555) + ((x >> 1) & 0x55555555); x = (x & 0x33333333) + ((x >> 2) & 0x33333333); x = (x & 0x0f0f0f0f) + ((x >> 4) & 0x0f0f0f0f); x = (x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff); x = (x & 0x0000ffff) + (x >> 16); </code></pre> assuming x is 32 bits.I think this approach is a classic divide-and-conquer solution.

评论 #27474062 未加载

评论 #27476872 未加载

评论 #27480412 未加载

评论 #27476328 未加载

dragontameralmost 4 years ago

GPU-programmers use popcount-based programming all the time these days, but the abstractions are built on top and are hardware accelerated.CUDA's __activemask(); returns the 32-bit value of your current 32-wide EXEC mask. That is to say, if your current warp is:<pre><code> int foo = 0; if(threadIdx.x %= 2){ foo = __activemask(); } </code></pre> foo will be "0b01010101...." or 0x55555555. This __activemask() has a number of useful properties should you use __popc with it.popcount(__activemask()); returns the number of threads executing.lanemask_lt() returns "0b0000000000000001" for the 0th lane. 0b0000000000000011 for the 1st lane. 0b0000000000000111... for the 2nd lane... and 111111111...111 for the last 31st lane.popcount(__activemask() & lanemask_lt()); returns the "active lane count". All together now, we can make a parallel SIMD-stack that can push/pop together in parallel.<pre><code> int head = 0; char buffer[0x1000]; while(fooBar()){ // Dynamic! We don't know who is, or is not active anymore int localPrefix = __popc(__activemask() & __lanemask_lt()); int totalWarpActive = __popc(__activemask()); buffer[head + localPrefix] = generateValueThisThread(); if(localPrefix == 0){ head += totalWarpActive; // Move the head forward, much like a "push" operation in single-thread land // Only one thread should move the head } __syncthreads(); // Thread barrier, make sure everyone is waiting on activeThread#0 before continuing. } </code></pre> ------------As such, you can dynamically load-balance between GPU threads (!!!) from a shared stack with minimal overheads.If you want to extend this larger than one 32-wide CUDA-warp, you'll need to use __shared __ memory to share the prefix with the rest of the block.It is a bad idea (too much overhead) to extend this much larger than a block, as there's no quick way to communicate outside of your block. Still though, having chunks of up to 1024 threads synchronized through a shared data-structure that only has nanoseconds of overhead is a nifty trick.-----------EDIT: Oh right, and this concept is now replicated very, very quickly in the dedicated __ballot_sync(...) function (which compiles down to just a few assembly instructions).Playing with the "Exec-mask" is a hugely efficient way to synchronously, and dynamically gather information across your warp. So lots of little tricks have been built around this.

评论 #27483174 未加载

4gotunameagainalmost 4 years ago

Another interesting application of popcount is in computer vision, namely in matching keypoints that use binary descriptors for 3D reconstruction in SLAM/TRN etc

评论 #27471885 未加载

oefrhaalmost 4 years ago

Discussed at the time: <a href="https://news.ycombinator.com/item?id=20914479" rel="nofollow">https://news.ycombinator.com/item?id=20914479</a>

评论 #27477508 未加载

ncmncmalmost 4 years ago

It is appalling that, after every other general-computing architecture in common use either started out with a popcount instruction, or had one added later at substantial expense, RISC-V came out without one.It still doesn't have any. The proposed B, "bitmanip" extension has it (along with a raft of trivial variations: count leading zeroes, count trailing ones, yada yada) but that is not ratified and not implemented in any chip I know of. Since B is a huge extension, we can expect it will be routinely omitted even after it's ratified, and compilers will need special prodding to produce any such instructions.It should have been in the base instruction set. We probably can blame its lack on the academic origins of the design. CS professors probably think of it as a thing not needed to implement Lisp, therefore not worth class time.(Some people say, "Oh, but you can trap and emulate it", which adds insult to injury. Trapping and emulating eliminates all the value the instruction offers.)

评论 #27481071 未加载

评论 #27513082 未加载

graderjsalmost 4 years ago

This is a great piece of writing. I wish more blogs tied together so many technical perspectives like this. Bravo auteur!

carapacealmost 4 years ago

(I just want to add that this is the best thread on HN i've read in a while. Y'all bringing a little nerdy tear to my eye. <3 )

ludamadalmost 4 years ago

My first thought "How else do you quickly count pieces on a bitboard?". Definitely chess programming caused me to never second guess the usefulness of `popcount`

pklausleralmost 4 years ago

Surely the Cray BMM (bit matrix multiplication) instructions have a better claim to that nickname.

bsmith0almost 4 years ago

Here's a dumb question. If someone asked me to do it I'd probably write code like:while(x != 0) { c += x&1; x >>= 1; }Is this something that should be added to LLVM?Edit: flip the order

评论 #27471821 未加载

评论 #27472508 未加载

评论 #27473084 未加载

评论 #27475990 未加载

FridayoLearyalmost 4 years ago

A bit off- topic but i want to know; is binary code (01etc) still used today in programming/coding? And for what applications?

评论 #27475253 未加载

评论 #27479618 未加载

rcgortonalmost 4 years ago

It is also incredibly useful for doing string scanning - look at strlen/strchr in various libc imp lementations

评论 #27474054 未加载